Support Vector Machines

advertisement
Fachbereich Informatik
Diplomarbeit
Support Vector Machines in der
digitalen Mustererkennung
Ausgeführt bei der Firma
Siemens VDO in Regensburg
vorgelegt von:
Christian Miklos
St.-Wolfgangstrasse 11
93051 Regensburg
Betreuer:
Herr Reinhard Rösl
Erstprüfer:
Prof. Jürgen Sauer
Zweitprüfer:
Prof. Dr. Herbert Kopp
Abgabedatum:
03.03.2004
Acknowledgements
This work was written as my diploma thesis in computer science at the
university of applied sciences Regensburg, Germany, under the supervision of Prof. Dr. Jürgen Sauer.
The research was carried out at Siemens VDO in Regensburg, Germany.
In Reinhard Rösl I found a very competent advisor there, whom I owe
much for his assistance in all aspects of my work. Thank you very much !
For the help during writing this document I want to thank all colleagues at
the department at Siemens VDO.
I have enjoyed the work there very much in any sense and learned a lot
which sure will be useful in the upcoming years.
My special thanks go to Prof. Jürgen Sauer who helped me out in any
questions arising during this work.
1
CONTENTS
ABSTRACT
5
NOTATIONS
6
0 INTRODUCTION
7
I AN INTRODUCTION TO THE LEARNING THEORY AND BASICS
9
1 SUPERVISED LEARNING THEORY
1.1
Modelling the Problem
2 LEARNING TERMINOLOGY
2.1
2.2
2.3
2.4
Risk Minimization
Structural Risk Minimization (SRM)
The VC Dimension
The VC Dimension of Support Vector Machines, Error Estimation and
Generalization Ability
3 PATTERN RECOGNITION
3.1
3.2
Feature Extraction
Classification
4 OPTIMIZATION THEORY
4.1
4.2
4.3
4.4
The Problem
Lagrangian Theory
Duality
Kuhn-Tucker Theory
10
11
14
14
16
17
18
21
22
22
25
25
29
32
33
II SUPPORT VECTOR MACHINES
35
5 LINEAR CLASSIFICATION
36
5.1
5.2
5.2.1
5.2.2
5.3
5.3.1
5.3.2
5.4
Linear Classifiers on Linear Separable Data
The Optimal Separating Hyperplane for Linear Separable Data
Support Vectors
Classification of unseen data
The Optimal Separating Hyperplane for Linear Non-Separable Data
1-Norm Soft Margin - or the Box Constraint
2-Norm Soft Margin - or Weighting the Diagonal The Duality of Linear Machines
36
39
46
49
50
52
54
57
5.5
Vector/Matrix Representation of the Optimization Problem and Summary
58
2
5.5.1
5.5.2
Vector/Matrix Representation
Summary
6 NONLINEAR CLASSIFIERS
6.1
6.2
6.2.1
6.2.2
6.2.3
6.3
Explicit Mappings
Implicit Mappings and the Kernel Trick
Requirements for Kernels - Mercer’s Condition Making Kernels from Kernels
Some well-known Kernels
Summary
7 MODEL SELECTION
7.1
7.2
The RBF Kernel
Cross Validation
8 MULTICLASS CLASSIFICATION
8.1
8.2
8.3
One-Versus-Rest (OVR)
One-Versus-One (OVO)
Other Methods
58
59
63
64
69
72
73
74
78
80
80
81
82
82
84
87
III IMPLEMENTATION
88
9 IMPLEMENTATION TECHNIQUES
89
9.1
9.2
9.2.1
9.2.2
9.2.3
9.2.4
9.2.5
9.2.6
9.3
9.3.1
9.3.2
9.4
9.4.1
9.4.2
9.4.3
9.4.4
General Techniques
Sequential Minimal Optimization (SMO)
Solving for two Lagrange Multipliers
Heuristics for choosing which Lagrange Multipliers to optimize
Updating the threshold b and the Error Cache
Speeding up SMO
The improved SMO algorithm by Keerthi
SMO and the 2-norm case
Data Pre-Processing
Categorical Features
Scaling
Matlab Implementation and Examples
Linear Kernel
Polynomial Kernel
Gaussian Kernel (RBF)
The Impact of the Penalty Parameter C on the Resulting Classifier
and the Margin
89
90
91
100
101
103
105
106
107
107
108
108
109
111
113
118
IV MANUALS, AVAILABLE TOOLBOXES AND SUMMARY
122
10 MANUAL
123
10.1
10.2
10.3
10.4
10.5
Matlab Implementaion
Matlab Examples
The C++ Implementation for the Neural Network Tool
Available Toolboxes implementing SVM
Overall Summary
LIST OF FIGURES
123
126
132
137
138
140
3
LIST OF TABLES
142
LITERATURE
143
STATEMENT
145
APPENDIX
146
A
SVM - APPLICATION EXAMPLES
146
A.1
A.2
146
147
Hand-written Digit Recognition
Text Categorization
B LINEAR CLASSIFIERS
B.1
The Perceptron
148
148
C
CALCULATION EXAMPLES
154
D
SMO PSEUDO CODES
159
D.1
D.2
159
161
Pseudo Code of original SMO
Pseudo Code of Keerthi’s improved SMO
4
ABSTRACT
The Support Vector Machine (SVM) is a new and very promising classification technique developed by Vapnik and his group at AT&T Bell Laboratories. This new learning algorithm can be seen as an alternative training
technique for Polynomial, Radial Basis Function and Multi-Layer Perceptron classifiers. Recently it has shown very good results in the pattern
recognition field of research, such as hand-written character and digit or
face recognition but they also proofed themselves reliable in text categorization. It is mathematically very funded and of great growing interest nowadays in many new fields of research such as Bioinformatics.
Die Support Vector Machine (SVM) ist eine neue und sehr vielversprechende Klassifizierungs-Methode, entwickelt von Vapnik und seiner Gruppe in den AT&T Bell Forschungseinrichtungen. Dieser neue Ansatz im Bereich des computergestützten Lernens kann als alternative Trainingstechnik für Polynom-, Gaußkern- und Multi-Layer Perzeptron-Klassifizierer
aufgefasst werden. In jüngster Zeit zeigte diese neue Technik sehr gute
Ergebnisse im Bereich der Mustererkennung, wie z.B. Erkennung von
handschriftlichen Buchstaben und Zahlen oder Gesichtszügen. Desweiteren wurde sie auch zuverlässig im Bereich der Textkategorisierung eingesetzt. Die Technik ist mathematisch sehr gut fundiert und von immer
wachsenderem Interesse in neueren Forschungsgebieten, wie der Bioinformatik.
5
NOTATIONS
x
input vector (input during training, already labelled)
z
input vector (input after training, to be classified)
y
output: class of input x ( z )
X
input space
xt
vector x transposed
a b
sgn( f ( x ))
l
S
( w, b )
inner product between vector a and b (dot product)
the signum function: +1 if f ( x )  0 and -1 else
training set size
training set {( x i , y i ) : i  1...l }
defines the hyperplane H  {x : w  x  b  0}
i
Lagrange multipliers
i
slack variables (for linear non-separable datasets)
i
margin of a single point x i
LP , LD
C
K ( a, b )
Lagrangian: primal and dual
error weight
kernel function calculated with vectors a and b
K
kernel matrix ( k ( x i , x j ) )
SVM
Support Vector Machine
SV
support vectors
nsv
number of support vectors
rbf
radial basis functions
LM
learning machine
ERM
Empirical Risk Minimization
SRM
Structural Risk Minimization
6
Chapter 0
Introduction
In this work the rather new concept in learning theory, the Support Vector
Machine, will be discussed in detail. The goal of this work is to give an insight into the methods used and to describe them in a way a person with
not so much funded mathematical background could understand them. So
the gap between theory and practice could be closed. It is not the intention
of this work to look in every aspect and algorithm available in the field of
this learning theory but to understand how and why it even works and why
it is of such rising interest at the time.
This work should lay the basics for understanding the mathematical background, to be able to implement the technique and to do further research
whether this technique is suitable for the wanted purpose at all.
As a product of this work the Support Vector Machine will be implemented
both in Matlab and C++. The C++ part will be a module integrated into the
so called “Neural Network Tool” already used in the department at Siemens VDO, which already implements the Polynomial and Radial-Basis
Function classifiers. This tool is for testing purposes to test suitable techniques for the later integration into the lane recognition system for cars
currently under development there.
Support Vector Machines for classification are a rather new concept in
learning theory. It’s origins reach back to the early 60’s (VAPNIK and
LEARNER 1963; VAPNIK and CHERVONENKIS 1964), but it stirred up
attention only in 1995 with Vladimir Vapnik’s book The Nature of Statistical
Learning Theory [Vap95]. In the last few years Support Vector Machines
proofed excellent performance in many real-word applications such as
hand-written character recognition, image classification or text categorization.
But because many aspects in this theory are still under intensive research
the number of introductory literature is very limited. The two books by Vladimir Vapnik (The Nature of Statistical Learning Theory [Vap 95] and Statistical Learning Theory [Vap98] present only a general high-level introduction to this field. The first tutorial purely on Support Vector Machines was
written by C. Burges in 1998 [Bur98]. In the year 2000 CHRISTIANINI and
SHAWE-TAYLOR published An introduction to Support Vector Machines
[Nel00], which was the main source for this work.
7
All these books and papers give a good overview of the theory behind
Support Vector Machines, but they don’t give a straightforward introduction to application. Here this work puts on.
This work is divided into four parts:
Part I gives an introduction into the supervised learning theory and the
ideas behind pattern recognition. Pattern recognition is the environment in
which the Support Vector Machine will be used in this work. The next
chapter will lay the mathematical basics for the optimization problem arising.
Part II then introduces the Support Vector Machine itself with its’ mathematical background. For a better understanding the case of classification
will be restricted to the two-class problem first but later one can see that
this is no problem because it then can easily be extended to the multiclass case. Here also the long studied kernel technique will be analysed in
detail which gives the Support Vector Machines their superior power.
Part III then analyses the implementation techniques for Support Vector
Machines. It will be shown that there are many approaches for solving the
arising optimization problem but only the most used and best performing
algorithms for a great amount of data will be discussed in detail.
Part IV in the end is intended as a manual for the implementation done in
Matlab and C++. There will also be given a list of widely used toolboxes
for Support Vector Machines, both in C/C++ and Matlab.
Last but not least in the appendix some real-world applications, some calculation examples on the arising mathematical problems, the rather simple
Perceptron algorithm for classification and the pseudo code used for the
implementation will be stated.
8
Part I
An introduction to the Learning
Theory and Basics
9
Chapter 1
Supervised Learning Theory
When computers are applied to solve a practical problem it is usually the
case that the method of deriving the required output from a set of inputs
can be described explicitly. But there arise many cases where one wants
the machine to perform tasks that cannot be described by an algorithm.
Such tasks cannot be solved by classical programming techniques, since
no mathematical model exists for them or the computation of the exact
solution is very expensive (it could last for hundreds of years, even on the
fastest processors). As examples consider the problem of performing
hand-written digit recognition (a classical problem of machine learning) or
the detection of faces on a picture.
There is need for a different approach to solve such problems. Maybe the
machine is teachable, as children are in school ? Meaning they are not
given abstract definitions and theories by the teacher but he points out
examples of the input-output functionality. Consider the children learning
the alphabet. The teacher does not give them precise definitions of each
letter, but he shows them examples. Thereby the children learn general
properties of the letters by examining these examples. In the end these
children will be able to read words in script style, even if they were taught
only on types.
In other more mathematical words this observations leads to the concept
of classifiers. The purpose of learning such a classifier from few given examples already correctly classified by the supervisor, is to be able to classify future unknown observations correctly.
But how can learning from examples, which is called supervised learning,
be formulized mathematically to let it be applied to a machine ?
10
1.1
Modelling the Problem
Learning from examples can be described in a general model by the following elements:
The generator of the input data x, the supervisor who assigns labels/classes y to the data for learning and the learning machine that returns some answer y’ hopefully close to the one of the supervisor.
The labelled/preclassified examples (x, y) are referred to as the training
data. The input/output pairings typically reflect a functional relationship,
mapping the inputs to outputs, though this is not always the case, for example when the outputs are corrupted by noise. But when an underlying
function exists it is referred to as the target function. So the goal is the estimation of this target function which is learnt by the learning machine and
is known as the solution of the learning problem. In case of classification
problems, e.g. “this is a man and this is a woman”, this function is also
known as the decision function. The optimal solution is chosen from a set
of candidate functions which map from the input space to the output domain. Usually a set or class of candidate functions is chosen known as
hypotheses. As an example consider so-called decision trees which are
hypotheses created by constructing a binary tree with simple decision
functions at the internal nodes and output values at the leaves (the yvalues).
A learning problem with binary outputs (0/1, yes/no, positive/negative, …)
is referred to as a binary classification problem, one with a finite number
of categories as a multi-class classification one, while for real-valued outputs the problem is known as regression. In this diploma thesis only the
first two categories will be considered, although the later discussed Support Vector Machines can be “easily” extended to the regression case.
A more mathematical interpretation of this will be given now.
The generator above determines the environment in which the supervisor
and the learning machine act. It generates the vectors x   n independently and identically distributed according to some unknown probability distribution P(x).
The supervisor assigns the “true” output values according to a conditional
distribution function P(y| x) (output is dependent on input). This assumption leads to the case y = f(x) in which the supervisor associates a fixed y
with every x.
The learning machine then is defined by a set of possible mappings
x  f ( x, α ) where α is element of a parameter space. An example of a
learning machine according to binary classification is defined by oriented
hyperplanes {x : w  x  b  0} where α  ( w, b)   n 1 determines the
11
position of the hyperplanes in  n . As a result the following learning machine (LM) is obtained:
LM  {f (x, ( w, b))  sgn( w  x  b); ( w, b)  n 1 } 1
The functions f :  n  {1,1} , mapping the input x to the positive (+1) or
negative (-1) class, are called the decision functions. So this learning machine works as follows: the input x is assigned to the positive class, if f(x)
 0, and otherwise to the negative class.
Figure 1.1: Multiple possible decision functions in the 2 . They are defined as
w  x  b  0 (for details see part II of this work). Points to the left are assigned to the
positive (+1)class, and the ones to the right to the negative (-1) class,
The above definition of a learning machine is called a Linear Learning Machine because of the linear nature of the function f used here. Among all
possible functions, the linear ones are the best understood and simplest to
apply. They will provide the framework within which the construction of
more complex systems is possible and will be done in later chapters.
There is need for a choice of the parameter α based on l observations
(the training set):
1
This method of a learning machine will be described in detail in Part II, because Support
Vector Machines implement this technique.
12
S  {( x 1, y 1 )...( x l , y l ); x i   n , y  {1,1}}
This is called the training of the machine. The training set S is drawn accordingly to the distribution P(x, y).
If all this data is given to the learner (the machine) at the start of the learning phase, this is called batch learning. But if the learner receives only one
example at a time and gives an estimation of the output before receiving
the correct value, it is called online learning. In this work only batch learning is considered. Also each of these two learning methods can be subdivided into unsupervised learning and supervised learning.
Once a function for appropriate mapping the input to the output is chosen
(learned), one wants to see how well it works on unseen data. Usually the
training set is split into two parts: the labelled training set above and the
so-called labelled test set. This test set is applied after training, knowing
the expected output values, and comparing the results of the classification
of the machine with the expected ones to gain the error rate of the machine.
But simply verifying the quality of an algorithm in such a way is not
enough. It is not only the goal of a gained hypothesis to be consistent with
the training set but also to work fine on future data. But there are also other problems inside the whole process of generating a verifiable consistent
hypothesis. First the function tried to learn may have not a simple representation and hence may not be easily verified in this way. Second the
training data could be frequently noisy and so there is no guarantee that
there is an underlying function which correctly maps the training data.
But the main problem arising in practice is the choice of the features. Features are the components the input vector x consists of. Sure they have to
describe the input data for classification in an “appropriate” way. Appropriate means, for example, no or less redundancy. Some hints on choosing a
suitable representation for the data will be given in the upcoming chapters
but not in detail because this would blow up the frame.
As an example to the second problem consider the classification of web
pages into categories, which can never be an exact science. But such data is increasingly of interest for learning. So there is a need for measuring
the quality of a classifier in some other way: Good generalization.
The ability of a hypothesis/classifier to correctly classify data, not only the
training set, or in other words, make precise predictions by learning from
few examples, is known as its generalization ability, and this is the property which has to be optimized.
13
Chapter 2
Learning Terminology
This chapter is intended to stress the main concepts arising from the theory of statistical learning [Vap79] and the VC Theory [Vap95]. These concepts are the fundamentals of learning machines. Here terms such as
generalization ability and capacity will be described.
2.1
Risk Minimization
As seen in the last chapter, the task of a learning machine is to infer general features from a set of labelled examples, the training set. It is the goal
to generalize from the training example to the whole range of possible observations. The success of this is measured by the ability to correctly classify new unseen data not belonging to the training set. This is called the
Generalization Ability.
But as in the training a set of possible hypothesis arise, there is need for
some measure to choose the optimal one what is the same as later measuring the generalization ability.
Mathematically this can be expressed using the Risk Function, a measure
of quality, using the expected classification error for a trained machine.
This expected risk (the test error), is the possible average error committed
by the chosen hypothesis f ( x, ) on the unknown example drawn randomly from the sample distribution P(x, y):
R (α ) 
1
 2 y  f (x, α ) dP (x, y )
(2.1)
1
y  f ( x, α ) is called the loss (difference between ex2
pected output (by supervisor) and the response of the learning machine).
R ( ) is referred to as the Risk Function or simply the risk. The goal is to
Here the function
14
find parameters α * such that f ( x, α * ) minimizes the risk over the class of
functions f ( x, α ) . But since P(x, y) is unknown, the value of the risk for a
given parameter α cannot be computed directly. The only available information is contained in the given training set S.
So the empirical risk R emp (α ) is defined to be just the measured mean error rate on the training set of finite length l:
R emp (α ) 
1
2l
l
y
i 1
i
 f ( x i , )
(2.2)
Note that here no probability distribution appears and R emp (α ) is a fixed
number for a particular choice of  and for a training set S.
For further considerations, assume binary classification with outputs
y i  {1,1} . Then the loss function can only produce the outputs 0 or 1.
Now choose some  such that 0    1 . Then for losses taking these
values, with probability 1   , the following bound holds [Vap95]:
 h(log( 2l / h)  1)  log( / 4) 
R( )  R emp ( )  

l


(2.3)
where h is a non-negative integer called the Vapnik Chervonenkis (VC)
dimension. It is a measure of the notion of capacity. The second summand
of the right hand side is called the VC confidence.
Capacity is the ability of a machine to learn any training set without error. It
is a measure of the richness or flexibility of the function class.
A machine with too much capacity tends to overfitting, whereas low capacity leads to errors on the training set. The most popular concept to describe the richness of a function class in machine learning is the Vapnik
Chervonenkis (VC) dimension.
Burges gives an illustrative example on capacity in his paper [Bur98]: “A
machine with too much capacity is like a botanist with a photographic
memory who, when presented with a new tree, concludes that it is not a
tree because it has a different number of leaves from anything he has
seen before. A machine with too little capacity is like the botanist’s lazy
brother, who declares that if it is green, it is a tree. Neither can generalize
well.”
15
To conclude this subchapter there can be drawn three key points about
the bound of (2.3):
First it is independent of the distribution P(x, y). It only assumes that the
training and test data are drawn independently according to some distribution P(x, y). Second, it is usually not possible to compute the left hand
side. Third, if h is known, it is easily possible to compute the right hand
side.
The bound also shows that low risk depends both on the chosen class of
functions (the learning machine) and on the particular function chosen by
the learning algorithm, the hypothesis, which should be optimal. The
bound decreases if a good separation on the training set is achieved by a
learning machine with low VC dimension. This approach leads to principles of the structural risk minimization (SRM).
2.2
Structural Risk Minimization (SRM)
Let the entire class of functions K  {f ( x,  )} , be divided into nested subsets of functions such that K1  K 2  .....  K n ... . For each subset it must
be able to compute the VC dimension h, or get a bound on h itself. Then
SRM consists of finding that subset of functions which minimizes the
bound on the risk. This can be done by simply training a series of machines, one for each subset, where for a given subset the goal of training
is simply to minimize the empirical risk. Then the trained machine in the
series whose sum of empirical risk and VC confidence is minimal.
So overall the approach is working as follows: The confidence interval is
kept fix (by choosing particular h’s) and the empirical risk is minimized. In
the neural network case this technique is adapted by first choosing an appropriate architecture and then eliminating classification errors. The second approach is to keep the empirical risk fixed (e.g. equal to zero) and
minimize the confidence interval. Support Vector Machines will also implement the principles of SRM, by finding the one canonical hyperplane
2
among all, which minimizes the norm w in the definition of a hyperplane
by: w  x  b  1.2
2
SVMs, hyperplanes, canonical hyperplanes and why minimizing the norm will be ex-
plained in part II of this work in detail, here only a reference is given to this.
16
2.3
The VC Dimension
The VC dimension is a property of a set of functions {f ( )} and can be defined for various classes of functions f. But again, here only the functions
corresponding to the two-class pattern case with y  {1,1} are considered.
Definition 2.1 (Shattering)
If a given set of l points can be labelled in all possible 2 lways, and for each
labelling, a member of the set {f ( )} can be found which correctly assigns
those labels (classifies), this set of points is said to be shattered by that
set of functions.
Definition 2.2 (VC Dimension)
The Vapnik Chervonenkis (VC) Dimension of a set of functions is defined
as the maximum number of training points that can be shattered by it. The
VC dimension is infinite, if l points can be shattered by the set of functions,
no matter how large l is.
Note that if the VC dimension is h, then there exists at least one set of h
points that can be shattered, but in general it will not be true that every set
of h points can be shattered.
As an example consider shattering points with oriented hyperplanes in
 n . To give an introduction assume the data lives in the  2 space and
the set of functions {f ( )} consists of oriented straight lines, so that for a
given line, all points on one side are assigned to the class +1, and all
points on the other one to the class -1. The orientation in the following figures is indicated by an arrow, specifying the side where the points of class
+1 are lying. While it is possible to find three points that can be shattered
(figure 2.1) by this set of functions, it is not possible to find four. Thus the
VC dimension of the set of oriented lines in the  2 is three.
Without proof (this can be found in [Bur98]) it can be stated that the VC
dimension of the set of oriented hyperplanes in the  n is n+1.
17
Figure 2.1: Three points not lying in a line can be shattered by oriented hyperplanes in
the 2 . The arrow points in the direction of the positive examples (black). Whereas four
points can be found in the 2 , which cannot be shattered by oriented hyperplanes.
2.4
The VC Dimension of Support Vector Machines, Error Estimation and Generalization Ability
It should be said first that this subchapter does not claim completeness in
any sense. There will be no proofs on the conclusions stated and the contents written are only excerpts of the theory. This is because the theory
stated here is beyond the intention of this work. The interested reader can
refer to the books about the Statistical Learning Theory [Vap79] , VC Theory [Vap95] and many other works on this. Here only some important subsets for Support Vector Machines of the whole theory will be shown.
Imagine points in the  2 , which should be binary classified: class +1 or
class -1. They are consistent with many classifiers (hypothesises, set of
18
functions). But how can one minimize the room of the hypothesis set ?
One approach is to apply a margin to each data point (figures 1.1 and 2.2),
then, the broader that margin, the smaller the room for hypotheses is. This
approach is justified by Vapnik’s learning theory.
Figure 2.2: Reducing the room for hypothesis by applying a margin to each point
The main conclusion of this technique is, that a wide margin often leads
to a good generalization ability but can restrict the flexibility in some
cases.
Therefore the later introduced maximal margin approach for Support Vector Machines is a practicable way. And this technique means that Support
Vector Machines implement the principles of Structural Risk Minimization.
The actual risk of Support Vector Machines was bounded by [Vap95] alternatively. The term Support Vectors here will be explained in part II of
this work, the bound is only stated here but is really general, because often one can see that the bound behaves in the other direction: Few Support Vectors, but high bound.
19
E [P (error )] 
E [number _ of _ sup port _ vectors ]
Numer _ of _ training _ samples
(2.4)
Where P(error) is the risk for a machine trained on l - 1 examples,
E[P(error)] is the expectation over all choices of training sets of size l – 1
and E[numer_of_support_vectors] is the expectation of the number of
support vectors over all choices of training sets of size l – 1.
To end this sub chapter some known VC dimensions of the later introduced Support Vector Machines should be stated, but without proof:
Support Vector Machines implementing Gaussian Kernels3 have infinite
VC dimension and the ones using polynomial Kernels of degree p have
 p  d l  1 4
  1 where d l is the dimension where the
VC dimension of 
p


data lives, e.g.  2 . So here the VC dimension is finite but grows rapidly
with the degree.
Against the bound of (2.3) this result is a disappointing one, because of
the infinite VC dimension when using Gaussian Kernels and therefore the
bound becoming useless.
But because of new developments in generalization theory, the usage of
even infinite VC dimensions becomes practicable. The main theory is
about Maximal Margin Bounds and gives another bound on the risk, which
is even applicable in the infinite case. The theory works with a new analysis method in contrast to the VC dimension: The fat-shattering dimension.
To look in the future: The generalization performance of Support Vector
Machines is excellent in contrast to other long studied methods, e.g. classification based on the Bayesian theory.
But as this is beyond this work, only a reference will be given here:
The paper “Generalization Performance of Support Vector Machines and
Other Pattern Classifiers” by Bartlett and Shawe-Taylor (1999).
Now the theoretic groundwork for looking into Support Vector Machines
has been laid and why they work at all.
3
See chapter 6
4
n
n!
  
, called the binomial coefficient
 k  k! (n  k )!
20
Chapter 3
Pattern Recognition
Figure 3.1: Computer vision: Image processing and pattern recognition. The whole
problem is split in sub problems to handle.
Pattern recognition is arranged into the computer vision part. Computer
vision tries to “teach” the human part of noticing and understanding the
environment to a machine. The main problem thereby arising is the illustration of the three-dimensional environment by two-dimensional sensors.
21
Definition 3.1 (Pattern recognition)
Pattern recognition is the theory of the best possible assignment of an unknown pattern or observation to a meaning-class (classification). In other words: The process of
identification of objects, with help of already learned examples.
So the purpose of a pattern recognition program is to analyze a scene
(mostly in the real world, with aid of an input device such as a camera, for
digitization) and to arrive at a description of the scene which is useful for
the accomplishment of some task, e.g. face detection or hand-written digit
recognition.
3.1
Feature Extraction
This part are the procedures for measuring the relevant shape information
contained in a pattern, so the task of classifying the pattern is made easy
by a formal procedure. For example, in character recognition a typical feature might be the height-to-width ration of a letter. Such a feature will be
useful for differentiating between a W and an I but distinguishing between
E and F this feature would be quite useless. So more features are necessary or the one given above has to replaced by another. The goal of feature extraction is to find as few features as possible that adequately differentiate the pattern in a particular application into their corresponding pattern classes. Because the more features there are, the more complicated
the task of classification could be, because the degree of freedom (the
dimension of vectors) grows and for each new feature introduced you
usually need some hundreds of new training points to get reliable statements on their derivation. To give a link to the Support Vector Machines
here: The theory on feature extraction is the main problem in practice, because of the proper selection you have to define (avoid redundancy) and
because of the amount of test data you have to create for training for each
new feature introduced.
3.2
Classification
The step of classification is concerned with making decisions concerning
the class membership of a pattern in question. The task is to design a de-
22
cision rule that is easy to compute and that will minimize the probability of
misclassification. To get a classifier, the one decided to fulfill this step has
to be trained by already classified examples, to get the optimal decision
rule, because when dealing with high complexity classes the classifier will
not be describable as a linear one.
Figure 3.2: Development steps of a classifier.
As an example consider the distinction between apples and pears. Here
the a-priori knowledge is, that pears are higher than broad and apples are
broader than high. So one feature would be the height-width-ratio. Another
feature that could be chosen is the weight. So the picture of figure 3.3 will
be gained after measurement of some examples. As it can be seen, the
classifier can nearly be approximated to a linear one (the horizontal line).
Other problems could consist of more than only two classes, the classes
could overlap and therefore there is need of some error-tolerating scheme
and the usage of nonlinearity.
There are two ways for training a classifier:


Supervised learning
Unsupervised learning
The technique of supervised learning uses a representatively sample,
meaning it describes the classes very good. The sample leads to a classification, which should approximate the real classes in feature space.
There the separation boundaries are computed.
23
In contrast to this, unsupervised learning uses algorithms, which analyze
the grouping tendency of the feature vectors into point clouds (clustering).
Simple algorithms are e.g. the minimum distance classification, the maximum likelihood classificator or classificators based on the Bayesian theory.
Figure 3.3: Training a classifier on the two-class problem of distinguishing between apples and pears by the usage of two features (weight and height-to-width
ratio).
.
24
Chapter 4
Optimization Theory
As we have seen in the first two chapters, the learning task may be formulated as an optimization problem. The searched hypothesis function
should therefore be chosen in a way, so the risk function is minimized.
Typically this optimization problem will be subject to some constraints
Later we will see that in the support vector theory we are only concerned
with the case, in which the function to be minimized/maximized, called the
cost function, is a convex quadratic function, while the constraints are all
linear. The known methods for solving such problems are called convex
quadratic programming.
In this chapter we will take a closer look at the Lagrangian theory, which is
the most adapted way to solve such optimization problems with many variables. Furthermore the concept of duality will be introduced, which plays a
major role in the concept of Support Vector Machines.
The Lagrangian theory was first introduced in 1797 and it only was able to
deal with functions constrained by equalities. Later in 1951 this theory was
extended by Kuhn and Tucker to be adapted to the case of inequality constraints. Nowadays this extension is known as the Kuhn-Tucker theory.
4.1
THE PROBLEM
The general optimization problem can be written as a minimization problem, since reversing the sign of the function to be optimized turns it in the
equal maximization problem.
Definition 4.1 (Primal Optimization Problem)
Given functions f, gi and hi defined on a domain   n , the problem can
be formulated:
Minimize
subject to
f(x)
;x
gi(x)  0 ; i = 1,…, k
hj(x) = 0 ; j = 1,…, m
25
where f(x) is called the objective function, gi the inequality and hj the
equality constraints. The optimal value of the function f is known as the
value of the optimization problem.
An optimization problem is called a linear program, if the objective function
and all constraints are linear, and a quadratic program, if the objective
function is quadratic, while the constraints remain linear.
Definition 4.2 (Standard Linear Optimization Problem)
Minimize
cTx
subject to
Ax = b
x  0
or reformulated this means:
Minimize
c1x1 + ……+ cnxn
subject to
a11x1 + … + a1nxn = b1
……..
an1x1 + … + annxn = bn
x  0
and another representation is:
Minimize
c x
subject to
a x
i
ij
i
j
 bi , mit i = 1…n
x  0
It is possible to rewrite each common linear optimization problem in this
standard form, even if the constraints are given as inequalities. For further
readings on this topic one can refer to many good textbooks about optimization available.
There are many ways to get the solution(s) of linear problems, e.g. Gaussian Reduction, Simplex with the Hessian Matrix, …., but we will not have
such problems and therefore do not discuss these techniques here.
26
Definition 4.3 (Standard Quadratic Optimization Problem)
Minimize
cTx + xTDx
subject to
Ax  b
x  0
with Matrix D overall positive (semi-) definite, so the objective function is
convex. Semi definite means, that for each x, xTDx  0 (in other words, D
has non-negative eigenvalues). Non-convex functions and domains are
not discussed here, because they will not play any role in the algorithms
for Support Vector Machines. For further readings on nonlinear optimization, refer to [Jah96]. So in this problem you have variables x in the form x
and x 2 , which does not lead to a linear system, where only the form x is
found.
Definition 4.4 (Convex domains)
A subdomain D of the  n is convex, if for any two points x,y  D the connection between them is also an element of D.
Mathematically this means:
for all h  [0,1], and x,y  D
(1-h)x + hy  D
For example the  2 is a convex domain. In figure 4.1 only the three upper
domains are convex.
Figure 4.1: 3 convex and 2 non-convex domains
27
Definition 4.5 (Convex functions)
A function f is said to be convex in D   n , if the domain D is convex and
for all x,y  D and h  [0,1] this applies:
f(hx + (1-h)y)  hf(x) + (1-h)f(y)
In words this means, that the graph of the function always lies under the
secant (or chord).
Figure 4.2: Convex and concave functions
Another criterion for convexity of twice differentiable functions is the positive semi definiteness of the Hessian Matrix [Jah96].
The problem of minimizing a convex function on a convex domain (set) is
known as a convex programming problem. The main advantage of such
problems is the fact, that every local solution to the convex problem is also
a global solution and that a global solution is always unique there. In nonlinear, non-convex problems, the main problem are the local minimums.
For example, algorithms implementing the Gradient-Descent (-Ascent)
method to find a minimum (maximum) of the objective function cannot
guarantee, that the found minimum is a global one, and so the solution
would not be optimal.
In the rest of this diploma and in the support vector theory, the optimization problem can be restricted to the case of a convex quadratic function
with linear constraints on the domain   n .
28
local minimum
Figure 4.4: A local minimum of a nonlinear and non-convex function
4.2
LAGRANGIAN THEORY
The intention of the Lagrangian theory is to characterize the solution of an
optimization problem initially, when there are no inequality constraints.
Later the method was extended to the presence of inequality constraints,
known as the Kuhn-Tucker theory.
To ease the understanding we first introduce the simplest case of optimization in absence of any constraints.
Theorem 4.6 (Fermat)
A necessary condition for w* to be a minimum of f(w) f  C 1 ,is that the
f ( w * )
first derivation
 0 . This condition is also sufficient if f is a convex
w
function.
Addition: A point x0=(x1…xn)   n realizing this condition is called a stationary point of the function f:  n   .
29
To use this on constrained problems, a function, known as the Lagrangian,
is defined, that unites information about both the objective function and its’
constraints. Then the stationarity of this can be used to find solutions.
In appendix C you can find a graphical solution to such a problem in two
variables and the calculated Lagrangian solution to the same problem.
Also an example for the general case is formulated there.
Definition 4.7 (Lagrangian)
Given an optimization problem with objective function f(w) and the equality
constraints h1(w) = c1, …. hn(w) = cn, the Lagrangian function is defined as
n
L(w,α) = f ( w )    i (c i  hi ( w ))
i 1
And as every equality can be transformed to hi(w) = 0 = c i , the Lagrangian is
n
L(w, α) = f ( w )    i hi ( w )
i 1
The coefficients αi are called the Lagrange multipliers.
Theorem 4.8 (Lagrange)
A necessary condition for a point w*   n to be a minimum (solution) of
the objective function f(w) subject to hi(w) = 0 , i = 1…n, with f, hi  C 1 is
L( w * , * )
0
w
(Derivation subject to w)
L( w * , * )
0

(Derivation subject to α )
These conditions are also sufficient in the case that L(w, α) is a convex
function. This means the solution is a global optimum.
30
The conditions provide a linear system of n+m equations, with the last m
the equality constraints (See appendix C for examples). By solving this
system one obtains the solution.
Note: At the optimal point the constraints equal zero and so the value of
the Lagrangian is equal to the objective function:
L(w, α) = f(w*)
As an interpretation of the Lagrange multiplier  of the function
f ( w )   (c  h( w )) , we assume it as a function of c and differentiate it with
respect to c:
L

c
But in the optimum L(w, α) = f(w*). So we can interpret that the Lagrange
multiplier  gives a hint on how the optimum is changing if the constant c
of the constraint g(w) = c is changed.
Now to the most general case, where the optimization problem both contains equality and inequality constraints.
Definition 4.9 (Generalized Lagrangian Function)
The general optimization problem can be stated as
Minimize
f(w)
subject to
g i ( w )  0 ; i = 1…k (inequalities)
h j ( w )  0 ; j = 1…m (equalities)
Then the generalized Lagrangian is defined as:
k
m
L( w, α, β)  f ( w )    i g i ( w )    i hi ( w )
i 1
i 1
 f ( w )  α g(w)  β h(w)
T
T
31
4.3
DUALITY
The introduction of dual variables is a powerful tool, because using this
alternative - the dual - reformulation of an optimization problem often turns
out to be easier to solve in contrast to its’ so called primal problem because the handling of inequality constraints in the primal (which are often
found) is very difficult. The dual problem to a primal problem is obtained by
introducing the Lagrange multipliers, also called the dual variables. So the
dual function does not depend on the primal variables anymore and solving this problem is the same as solving the primal one. The new dual variables are then considered to be the fundamental unknowns of the problem. Duality is also a common procedure in linear optimization problems.
For further readings look at [Mar00]. In general the primal minimization
problem is then turned in the dual maximization one. So at the optimal
solution point the primal and the dual function both meet with having an
extreme there (convex functions would only have one global extreme).
Here we only look at the duality method important for Support Vector Machines. To transform the primal problem in its’ dual one two steps are necessary. First, the derivatives of the set up primal Lagrangian are set to zero with respect to the primal variables. Second, substitute the so gained
relations back into the Lagrangian. This removes the dependency on the
primal variables and corresponds to explicitly computing the new function
 (α,β)  inf L( w, α, β) 5
w 
For proof of this see [Nel00].
So overall the primal minimization problem of definition 4.1 can be transformed in the dual problem as:
Definition 4.10 (Lagrangian Dual Problem)
Maximize
subject to
5
Inf = infimum
 (α,β)  inf L( w, α, β)
w 
α0
The infimum of any subset of a linear order (linearly ordered set) is the
greatest lower bound of the subset. In particular, the infimum of any set of numbers is the
largest number in the set which is less than or equal to every other number in the set.
Rewritten this means:  (α,β)  inf L( w, α,β)  max(min(L( w, α,β)))
w
α 0
w
32
This strategy is a standard technique in the theory of Support Vector Machines. As seen later, the dual representation allows us to work in high
dimensional spaces using so called Kernels without “falling prey to the
curse of dimensionality”6. The Kuhn-Tucker complementary conditions,
introduced in the following subchapter, lead to a significant reduction of
the data involved in the training process. These conditions imply that only
the active constraints have non-zero dual variables and therefore are necessary to determine the searched hyperplane. This observation will later
lead to the term support vectors, as seen in chapter 5.
4.4
KUHN-TUCKER THEORY
Theorem 4.11 (Kuhn-Tucker)
Given an optimization problem with convex domain   n ,
Minimize
f(w)
subject to
g i (w)  0
h j (w )  0
w
i = 1…k
j = 1…m
with f  C 1 convex and gi, hi affine, necessary and sufficient conditions
for a point w* to be a optimum, are the existence of α * ,β * such that

L( w * , α * , β * )  0
w

L( w * , α * , β * )  0

 i* g i ( w * )  0
; i = 1…k
g i (w )  0
*
 i*  0
6
Explained in chapter 6
33
The third relation is also known as the KT complementary condition. It implies that for active constraints,  i*  0 , whereas for inactive ones  i*  0 .
As interpretation of the complementary condition one can say, that a solution point can be in one of two positions with respect to an inequality constraint. Either in the interior of the feasible region, with the constraint inactive, or on the boundary defined by that constraint with the constraint active. So the KT conditions say that either a constraint is active, meaning
g i ( w * )  0 and  i*  0 , or the corresponding multiplier  i*  0 .
So the KT conditions give a hint on how the solution looks like and how
the Lagrange multipliers behave. And a point is only an optimal solution if and only if these KT conditions are fulfilled.
Summarizing this chapter it can be said that all the theorems and definitions above give some useful techniques for solving convex optimization
problems with inequality and equality constraints both acting at the same
time. The goal of the techniques is to “simplify” the primal given problem
by formulizing the dual one, in which the constraints are mostly equalities
which are easier to handle. The KT conditions describe the optimal solution and its’ important behaviour and will be the stopping criterion for the
later implemented numerical solutions.
Later in the chapters about implementation of the solving algorithms to
such optimization problems we will see that the main problem will be the
size of the training set, which therefore defines the size of the kernel matrix as a solution. With the use of standard techniques for calculating the
solution, the kernel matrix will fast exceed hundreds of megabytes in the
memory even when the sample size is just a few thousand points (which is
not much in real-world applications).
34
Part II
Support Vector Machines
35
Chapter 5
Linear Classification
5.1 Linear Classifiers on Linear Separable Data
As a first step in understanding and constructing Support Vector Machines
we study the case of linear separable data, which is simply classified into
two classes, the positive and the negative one, also known as binary classification. To give a link to an example, important nowadays, imagine the
classification problem of email into spam or not-spam. (A calculated example and examples on linear (non-)separable data can be found in Appendix B.2)
This is frequently performed by using a real-valued function
f : X  n   in the following way:
The input x = (x1, … , xn)’ is assigned to the positive class, if f ( x )  0 , and
otherwise to the negative one.
The vector x is build up by the relevant features which are used for classification.
In our spam example above we need to extract relevant features (certain
words) from the text and build a feature vector for the corresponding document. Often such feature vectors consist of the counted numbers of predefined words as in figure 5.1. If you would like to learn more about text
classification / categorization, you can have a look at [Joa98], where the
feature vectors have dimensions in the range about 9000.
In this diploma we assume that the features are already available.
We consider the case where f (x ) is a linear function of x  X , so it can be
written as
f ( x)  w  x  b
(5.1)
 w i xi  b
where (w,b)   n   are the parameters.
36
Figure 5.1: Vector representation of the sentence “Take Viagra before watching a video
or leave Viagra be to play in our online casino.”
These are often referred to as weight vector w and bias b, terms borrowed
from the neural network literature.
As stated in Part I, the goal is to learn these parameters from the given
and already classified data (done by the supervisor/teacher), the training
set. So this way of learning is called supervised learning.
So the decision function for classification of an input x = (x1, … , xn)’ is given by sgn( f ( x )) :
1, if f ( x )  0 (positive class)
sgn( f ( x ) ) =
-1, else
(negative class)
Geometrically we can interpret this behaviour as follows (see figure 5.2):
One can see that the input space X is split into two parts by the so called
hyperplane defined by the equation w  x  b  0 .
This means, every input vector solving this equation is directly part of the
hyperplane. A Hyperplane is an affine subspace7 of dimension n-1 which
divides the space into two half spaces which correspond to the inputs of
the two distinct classes.
7
A translation of a linear subspace of n is called an affine subspace. For example, any
line or plane in 3 is an affine subspace.
37
In the example of figure 5.2 n is 2, a two dimensional input space, so the
hyperplane is simply a line here.
The vector w therefore defines a direction perpendicular to the hyperplane, so the direction of the plane is unique, while varying the value of b
moves the hyperplane parallel to itself. Whereby negative values of b
move the hyperplane, running through the origin, into the “positive direction”.
Figure 5.2: A separating hyperplane (w,b) for a two dimensional training set. The smaller
dotted lines represent the class of hyperplanes with same w and different values of b.
In fact it is clearly to see that if one wants to represent all possible hyperplanes in the space  n the representation is only possible by involving
n + 1 free parameters, n ones given by w and one by b.
But the question that arises here is, which hyperplane to choose, because
there are many possible ways in which it can separate the data. So we
need a criterion for choosing ‘the best one’, the ‘optimal’ separating hyperplane.
The goal behind supervised learning from examples for classification can
be restricted to consideration of the two-class problem without loss of
generality. In this problem the goal is to separate the two classes by a
function, which is induced from available examples. The overall goal is to
produce a classifier (by finding parameters w and b) that will work well on
unseen examples, i.e. it generalizes well.
38
So if the distance between the separating hyperplane and the training
points becomes too small, even test examples near to the given training
points would be misclassified. Figure 5.3 illustrates this behaviour.
Therefore it seems that the classification of unseen data is much more
successful in setting B than in setting A.
This observation leads to the concept of the maximal margin hyperplanes,
or the optimal separating hyperplane.
Figure 5.3: Which separation to choose ? Almost zero margin (A) or large margin (B) ?
In appendix B.2 we have a closer look at an example with a ‘simple’ iterative algorithm, separating points from two classes by means of a hyperplane, the so called Perceptron. It is only applicable on linear separable
data.
There we also find some important issues, also stressed in the following
chapters, which will have a large impact on the algorithm(s) used in the
Support Vector Machines.
5.2 The Optimal Separating Hyperplane for Linear
Separable Data
Definition 5.1 (Margin)
Consider the separating hyperplane H defined by
both w and b normalised by w =
w  x  b  0 , with
1
1
w and b =
b.
w
w
39
 The (functional) margin  i (w,b) of an example (xi,yi) with respect to H is
defined as the distance between xi and H:
 i (w,b) = yi( w  x  b )
 The margin  S (w,b) of a set of vectors A = {x1, …, xn} is defined as the
minimum distance from H to the vectors in A:
 S (w,b) = min  i (w,b)
xiA
For clarification see figures 5.4 and 5.5.
Figure 5.4: The (functional) margin of two points in respect to a hyperplane
In figure 5.5 we have introduced two new identifiers: d+ and d- :
Let them be the shortest distance from the separating hyperplane H to the
closest positive (negative) example (the smallest functional margin from
each class). Then the geometric margin is defined as d+ + d- .
40
Figure 5.5: The (geometric) margin of a training set
The training set is therefore said to be optimally separated by the hyperplane, if it is separated without any error and the distance between the
closest vectors to the hyperplane is maximal (maximal margin)
[Vap98].
.
So the goal is to maximize the margin
As Vapnik showed in his work [Vap98] we can assume canonical hyperplanes in the upcoming discussion without loss of generality.
This is necessary because there exists the following problem:
For any scaling parameter c  0 :
E.g.
wx b ↔
cw  x  cb
 2
   x  2  0
 2
 2 * x1  2 * x 2  2  0
 1
A possible solution is x   
0
With a parameter c of value 5, we will get
41
 2
5 *    x  5 * 2  0
 2
 10 * x1  10 * x 2  10  0
 1
which can also be solved by x    .
0
So (cw, cb) describe the same hyperplane as (w, b) do. This means the
hyperplane is not described uniquely !
For uniqueness, (w,b) always need to be scaled by a factor c relatively to
the training set. The following constraint is chosen to do this:
min w  x i  b  1
This constraint scales the hyperplane in a way such that the training
points, nearest to it, get some important property. Now they solve
w  x i  b  1 for xi of class yi = +1 and on the other side,
w  x i  b  1 for xi of class yi = -1.
A such scaled hyperplane is called a canonical hyperplane.
Reformulated this means (implying correct classification):
y i  w  x i  b  1 ; i = 1 … l
(5.2)
This can be transformed into the following constraints:
w  x i  b  1
for yi = +1
w  x i  b  1
for yi = -1
(5.3)
Therefore it is clearly to see, that the hyperplanes H1 and H2 in figure 5.5
are solving w  x  b  1 and w  x  b  1. They are called margin
hyperplanes.
Note that H1 and H2 are parallel, they have the same normal w (as H does
too), and that no other training points fall between them in the margin !
They solve min w  x i  b  1.
42
Definition 5.2 (Distance)
The Euclidian distance d(w,b; xi) of a point xi belonging to a class yi
  1
, 1 from the hyperplane (w,b) that is defined by w  x i  b  0 is,
w  xi  b
d ( w, b; x i ) 
(5.4)
w
As stated above, training points (x1, +1) and (x2, -1) that are nearest to the
so scaled hyperplane, respectively they lie on H1 and H2, have the distance d+ = +1 and d- = -1 from it (see figure 5.5).
Or reformulated with equation 5.4 and constraints 5.3, this means:
w  x1  b  1

w
b
1
 x1 

w
w
w
( 5.4 )
 d 
and
and
1
w
w  x 2  b  1
w
b
1
 x2 

w
w
w
and d  
1
w
So overall, as seen in figure 5.5, the geometric margin of a separating ca2
nonical hyperplane is d+ + d-, and so
.
w
As stated, the goal is to maximize this margin. That is achieved by minimising w . The transformation to a quadratic function of the form
1
2
w does not change the result but will ease later calculation.
2
This is because we now solve the problem with help of the Lagrangian
method. There are two reasons for doing so. First the constraints of (5.2)
will be replaced by constraints on the Lagrangian themselves, which will
( w ) 
43
be much easier to handle (they are equalities then). Second the training
data will only appear in the form of dot products between vectors, which
will be a crucial concept later in generalizing the method on the nonlinear
separable case and the use of kernels.
And so the problem is reformulated in a convex one, which is overall easier to handle by the Lagrangian method with its’ differentiations.
Summarizing we have the following optimization problem to solve:
Given a linearly separable training set S = ((x1,y1), …, (xl,yl))
Minimize
1
w
2
subject to
w  x i  b  1
for yi = +1
w  x i  b  1
for yi = -1
2
(5.5)
The constraints are necessary to ensure uniqueness of the hyperplane, as
mentioned above !
Note:
1
w
2
w
2
2

1
w  w , because
2
 w 12  w 22  ...  w n2
2
 w 12  w 22  ...  w n2
w  w  w 1 * w 1  .....  w n * w n  w 12  w 22  ...  w n2
Also, the optimization problem is independent of the Bias b, because the provided equation 5.2 is satisfied, i.e. it is a separating
hyperplane. So changing the value of b only moves it in the normal direction to itself. Accordingly the margin remains unchanged
but the hyperplane would no longer be optimal.
The problem of 5.5 is known as convex quadratic optimization8 problem
with linear constraints, and can be efficiently solved by using the method
of the Lagrange Multipliers and the duality theory (see chapter 4).
8
Convexity will be proofed in chapter 5.5.2
44
The primal Lagrangian for (5.5) and the given linearly separable training
set S = ((x1,y1), …, (xl,yl)) is
LP ( w, b, α) 
l
1
w  w    i y i  w  x i  b   1
2
i 1
(5.6)
where  i  0 are the Lagrange Multipliers. This Lagrangian LP has to be
minimized with respect to the primal variables w and b.
As seen in chapter 4, at the saddle point the two derivations with respect
to w and b must vanish (stationarity),
l

LP ( w, b, α)  w   y i  i x i  0
w
i 1

LP ( w, b, α) 
b
l
y 
i
i
0
i 1
obtaining the following relations:
l
w   y i i xi
i 1
0
l
y 
i 1
i
(5.7)
i
By substituting the relations (5.7) back into LP one arrives at the so called
Wolfe Dual of the optimization problem (now only dependable on α , no
more w and b!):
LD ( w, b, α) 

l
1
w  w    i y i  w  x i  b   1
2
i 1
l
l
1 l


y
y
x

x



y
y

i
 i j i j i j i

i
j i
j
2 i , j 1
, j 1
i 1
l
  i 
i 1
(5.8)
1 l
  i  j y i y j x i  x j  W (α )
2 i , j 1
So the dual problem for (5.6) can be formulated:
Given a linearly separable training set S = ((x1,y1), …, (xl,yl))
l
Maximize
W (α )    i 
i 1
subject to
1 l
 i  j y i y j x i  x j
2 i , j 1
 i  0 ; i = 1…l
l
 y
i 1
i
i
(5.9)
0
45
Note: The matrix x i  x j
is known as the Gram Matrix G.
So the goal is to find parameters α * which solve this optimization problem.
As a solution to construct the optimal separating hyperplane with maximal
margin we obtain the optimal weight vector:
l
w *    i* y i x i
(5.10)
i 1
Remark: One can think that up to now, the problem will be able to be
solved easily as the one in appendix C with the use of Lagrangian theory
and the primal (dual) objective function. This could be right if having input
vectors of small dimension, e.g. 2. But in the real-world case the number
of variables will be over some thousand ones. Here solving the system
with standard techniques will not be practicable in the case of time and
memory usage of the corresponding vectors and matrices. But this issue
will be discussed in the implementation chapter later.
5.2.1
Support Vectors
Stating the Kuhn-Tucker (KT) conditions for the primal problem LP above
(5.6), as seen in chapter 4, we get

LP ( w * , b * , α * )  w    i* y i x i  0
w
i

LP ( w * , b * , α * )    i* y i  0
b
i
y i ( w *  xi  b* )  1  0
(5.11)
i  0
 i [ y i ( w *  x i  b * )  1]  0
As mentioned, the optimization problem for SVMs is a convex one (a convex function, with constraints giving a convex feasible region). And for
convex problems the KT conditions are necessary and sufficient for w*, b
and α to be a solution. Thus solving the primal/dual problem of the SVMs
46
is equivalent to finding a solution to the KT conditions (for the primal)9 (see
chapter 3, too).
The fifth relation in (5.11) is known as the KT complementary condition.
In the third chapter on optimization theory an intention was given on how it
works. In the SVM´s problem it has a good graphical meaning. It states
that for a given training point xi either the corresponding Lagrange Multiplier  i equals zero or, if not zero, xi lies on one of the margin hyperplanes
(see figure 5.4 and following text) H1 or H2:
H 1 : w  x i  b  1
H 2 : w  x i  b  1
On them are the training points xi with minimal distance to the optimal
separating hyperplane OSH (with maximal margin).
The vectors lying on H1 or H2, implying  i  0 are called Support Vectors
(SV).
Definition 5.3 (Support Vectors)
A training point xi is called support vector, if its corresponding Lagrange
multiplier  i  0 .
All other training points having  i  0 either lie on one of the two margin
hyperplanes (equality of (5.2)) or on the side of H1 or H2 (inequality of
(5.2)).
A training point can be on one of the two margin hyperplanes, because the
complementary condition in (5.11) only states that that all SVs are on the
margin hyperplanes, but not that the SVs are the only ones on them. So
there may be the case where both  i  0 and y i ( w  x i  b)  1  0 .
Then the point xi lies on one of the two margin hyperplanes without being
a SV.
Therefore SVs are the only points involved in determining the optimal
weight vector in equation (5.10).
So the crucial concept here is that the optimal separating hyperplane is
uniquely defined by the SVs of a training set. That means, repeating the
training with all other points removed or moved around without crossing H 1
or H2 lead to the same weight vector and therefore to the same optimal
separating hyperplane.
9
only they will be needed because the primal/dual problem is a equivalent one, so we will
maximize the dual (it is only dependable on α !) and as a criterion take the KT conditions
of the primal.
47
In other words, a compression has taken place. So for repeating the training later, the same result can be achieved by only using the determined
SVs.
Figure 5.6: The optimal separating hyperplane (OSH) with maximal margin is determined by the support vectors (SV, marked) lying on the margin hyperplanes H1 and H2.
Note that in the dual representation the value of b does not appear and so
the optimal value b* has to be found making use of the primal constraints:
y i  w  x i  b  1  0 ; i = 1 … l
So only the optimal value of w is explicitly determined by the training procedure. This implies we have optimal values for α . Therefore it is possible
to pick any α i  0 , a support vector, and so, with the substitution
w    i y i x i in the above inequality the upper constraint becomes an
i 1
equality ( = 0 because a support vector always is part of a margin hyperplane) and b can be computed.
Numerically it is safer to compute b for all i and take the mean value, or
another approach as in the book [Nel00]:
48
b 
*
max y i  1 ( w *  x i )  min y i 1 ( w *  x i )
2
(5.12)
Note: This approach to compute the bias has been shown to be problematic with regard to the implementation of the SMO algorithm, as showed by
[Ker01]. This issue will be discussed in the implementation chapter later.
5.2.2
Classification of unseen data
After the hyperplanes’ parameters (w* and b*) have been learned with the
training set we can classify unseen/unlabeled data points z. In the binary
case (2 classes), discussed up to now, the found hyperplane divides the
 n into two regions. One where w *  x  b *  0 and the other one
where w *  x  b *  0 . The idea behind the maximal margin classifier is
to determine on which of the two sides the test pattern lies and to assign
the label correspondingly with -1 or +1 (as all classifiers) and also to maximize the margin between the two sets.
Hence the used decision function can be expressed with the optimal parameters w* and b* and therefore by the found/used support vectors x iSV ,
their corresponding  i*  0 and b*.
So overall the decision function of the trained maximal margin classifier for
some data point z can be formulated:
f ( z, α * , b * )  sgn( w *  z  b * )
l
 sgn(  y i  i* x i  z  b * )
(5.13)
i 1
 sgn(  y i  i* x i  z  b * )
i SV
Whereby the last reformulation only sums over the elements, training point
xi, corresponding label yi, associated  i and the bias b, which are associated with a support vector (SV), because only they have  i  0 and therefore an impact on the sum.
All in all, the optimal separating hyperplane we get by solving the margin
optimization problem is a very simple special case of a Support Vector
Machine, because it computes directly on the input data. But it is a good
starting point for understanding the forthcoming concepts. In the next
chapters the concept will be generalized to nonlinear classifiers and there-
49
fore the concept of Kernel mapping will be introduced. But first the adaption of the separating hyperplane on linearly non-separable data will be
done.
5.3
The Optimal Separating Hyperplane for
Linear Non-Separable Data
The algorithm above for the maximal margin classifier cannot be used in
many real-world applications. In general noisy data will render linear separation impossible but the hugest problem will still be the used features in
practice leading to overlapping classes. The main problem with the maximal margin classifier is the fact, that it allows no classification errors during
training. Either the training is perfect without any errors or there is no solution at all.
Hence it is intuitive that we need a way to relax the constraints of (5.3).
But each violation of the constraints needs to be “punished” by a misclassification penalty, i.e. an increase in the primal objective function L P.
This can be realized by introducing the so called positive slack variables
 i (i = 1…l) in the constraints first and, as shown later, introduce an error
weight C, too:
w  x i  b  1   i
for yi = +1
w  x i  b  1   i
for yi = -1
i  0
As above, these two constraints can be rewritten into one:
y i ( w  x i  b)  1   i  0 ;i= 1…l
(5.14)
So the  i `s can be interpreted as a value that measures how much a point
1
fails to have a margin (distance to the OSH) of
. So it indicates where
w
a point xi lies, compared to the separating hyperplane (see figure 5.7).
i  1  y i  w  xi  b  0  misclassification
0   i  1  x i is classified correctly, but
lies inside the margin
50
 i  0  x i is classified correctly and
lies outside the margin or on
the margin boundary
So a classification error is marked by the corresponding  i exceeding unil
ty. Therefore

i 1
i
is an upper bound on the number of training errors.
Overall with the introduction of these slack variables the goal is to maximize the margin and simultaneously minimize misclassifications.
To define a penalty on training errors the error weight C is introduced by
l
C  i .
i 1
This parameter has to be chosen by the user. In practice, C is varied
through a wide range of values and the optimal performance is assessed
using a separate validation set or a technique called cross-validation for
verifying performance just using the training set.
Figure 5.7:
margin (  i
Values of slack variables: (1) misclassification if
 i is
larger than the
 1); (2) correct classification of xi lying in the margin with 0   i  1; (3)
correct classification of xi outside the margin (or on it) with
i  0
51
So the optimization problem can be extended to
Minimize
subject to
1
w
2
2
l
+ C   ik
i 1
y i ( w  x i  b)  1   i  0 ; i = 1…l
(5.15)
i  0
The problem is again a convex one for any positive integer k. This approach is called the Soft Margin Generalization, while the original concept
above is known as Hard Margin, because it allows no errors. The Soft
Margin case is widely adapted to the values of k = 1 (1-Norm Soft Margin)
and k = 2 (2-Norm Soft Margin).
5.3.1
1-Norm Soft Margin - or the Box Constraint
For k = 1, as above, the primal Lagrangian can be formulated as
LP ( w, b, ξ, α, ß) 
l
l
l
1
w  w  C   i    i [ y i ( w  x i  b )  1   i ]   ßi  i
2
i 1
i 1
i 1
with  i  0;  i  0 .
Note: As described in chapter 4, we need another parameter ß here, because of the new inequality constraint  i  0 .
As before, the corresponding dual representation is found by differentiating LP with respect to w, ξ and b:
l

LP  w    i y i x i  0
w
i 1

LP  C   i   i  0
ξ i
l

LP    i y i  0
b
i 1
By resubstituting these relations back into the primal we obtain the dual
formulation LD:
52
Given a training set S = ((x1,y1), …, (xl,yl))
Maximize
l
LD ( w, b, ξ, α,β)  W (α)    i 
i 1
1 l
 i  j y i y j x i  x j
2 i , j 1
(5.16)
l
subject to
 y
i 1
i
i
0
0  i  C
i = 1…l
This problem is curiously identical to that for the maximal (hard) margin
one in (5.9). The only difference is that C   i   i  0 together with
 i  0 enforces  i  C . So in the soft margin case the Lagrange multipliers are upper bounded by C. The Kuhn-Tucker complementary conditions
for the primal above are:
 i [ y i ( w  x i  b)  1   i ]  0
 i ( i  C )  0
;i = 1…l
;i = 1…l
Another consequence of the KT conditions is that they imply that non-zero
slack variables  i can only occur when  i  0 and therefore  i  C . The
corresponding point xi has a distance less than 1/ w from the hyperplane
and therefore lies inside the margin.
This can be seen with the constraints (only shown for yi = +1, the other
case is analogous):
w  x i  b  1 
w
b
1
 xi 

w
w
w
for points on the margin hy-
perplane.
 w  x i  b  1   i 

w
b
1
1
 xi 

 i 
w
w
w
w
w
And therefore points xi with non-zero slack variables have a distance less
than 1/ w .
Points for which 0   i  C then lie exactly at the target distance of 1/ w
and therefore on one of the margin hyperplanes (  i  0 ). This also shows
that the hard margin hyperplane can be attained in the soft margin case by
setting C to infinity (  ).
53
The fact that the Lagrange multipliers  i are upper bounded by the value of
C gives the name to this technique: box constraint. Because the vector α
is constrained to lie inside the box with side length C in the positive orthant
( 0   i ). This approach is also known under SVM with linear loss function.
5.3.2
2-Norm Soft Margin - or Weighting the Diagonal -
This is the case for k = 2. But before stating the primal Lagrangian and for
ease of the upcoming calculation, note that for  i  0 the first constraint of
(5.15) still holds if  i  0 . Hence we still obtain the optimal solution when
the positivity constraint on  i is removed. So this leads to the following
primal Lagrangian:
LP ( w, b, ξ, α) 

l
1
C l
w  w    i2    i [ y i ( w  x i  b )  1   i ]
2
2 i 1
i 1
l
l
l
l
1
C l
w  w    i y i w  x i    i    i  i  b  i y i    i
2
2 i 1
i 1
i 1
i 1
i 1
with  i  0 the Lagrange multipliers again.
As before the corresponding dual is found by differentiating with respect to
w, ξ and b, imposing stationarity (i.e. setting to zero):
l

LP  w    i y i x i  0
w
i 1

LP  Cξ  α  0
ξ
l

LP    i y i  0
b
i 1
and again resubstituting the relations back into the primal to obtain the
dual formulation LD:
l
LD ( w, b, ξ, α)    i 
i 1
l
  i 
i 1
1 l
1
1
 i j y i y j xi  x j 
αα 
αα

2 i , j 1
2C
C
1 l
1
 i j y i y j xi  x j 
αα

2 i , j 1
2C
54
l
Using the equation α  α    i2 
i 1
l
 i  j  ij 
i , j 1
l
 
i , j 1
i
j
y i y j  ij
where  ij is the Kronecker Delta, which is defined to be 1 if i = j and 0 otherwise. So on the right side of above equation inserting y i y j changes
nothing at the result because y i is either +1 or -1 and y i y j  ij is the same
as writing y i2 , and so we simply multiply extra by 1, but can simplify L D to
get the final problem to be solved:
Given a training set S = ((x1,y1), …, (xl,yl))
Maximize
1 l
1
LD ( w, b, ξ, α)  W (α)    i    i  j y i y j ( x i  x j   ij )
2 i , j 1
C
i 1
l
(5.17)
l
subject to
 y
i 1
i
0  i
i
0
;i = 1…l
The complementary KT conditions for the primal problem above are
 i [ y i ( w  x i  b)  1   i ]  0
;i = 1…l
This whole problem can be solved with the same methods used for the
maximal margin classifier. The only difference is the addition of 1/C to the
diagonal of the Gram matrix G = x i  x j . Only on the diagonal, because
of the Kronecker Delta. This approach is also known under SVM with
quadratic loss function.
Summarizing this subchapter it can be said that the soft margin optimization is a compromise between little empirical risk and maximal margin. For
an example look at figure 5.8. The value of C can be interpreted as representing the trade-off between minimizing the training set error and maximizing the margin. So all in all, by using C as an upper bound on the Lagrange multipliers, the role of “outliers” is reduced by preventing a point
from having too large Lagrange multipliers.
55
(a)
(b)
56
(c)
Figure 5.8: Decision boundaries arising when using a Gaussian kernel with fixed value
of  in the three different machines: (a) the maximal margin SVM, (b) the 1-norm soft
margin SVM and (c) the 2-norm soft margin SVM. The data are an artificially created two
dimensional set. The blue dots being positive examples and the red ones negative examples.
5.4
The Duality of Linear Machines
This section is intended to stress the fact that was used and remarked
several times before. The linear machines introduced above can be formulated in a dual description. This reformulation will turn out to be crucial in
the construction of the more powerful generalized Support Vector Machines below.
But what does ‘duality of classifiers’ mean ?
As seen in the former chapter the normal vector w can be represented as
a linear combination of the training points:
l
w   i y i xi
i 1
with S = ((x1,y1), …, (xl,yl)) the given training set already classified by the
supervisor. The  i were introduced in the used Lagrange way to find a
57
solution to the margin maximization problem. They were called the dual
variables of the problem and therefore the fundamental unknowns. On the
way to the solution we then obtain
l
W (α )    i 
i 1
1 l
 i  j y i y j x i  x j
2 i , j 1
and the reformulated decision function for unseen data z of (5.13):
f ( z, α, b)  sgn( w  z  b )
l
 sgn(   i y i x i  z  b )
i 1
The crucial observation here is that the training and test points never act
through their individual attributes. These points only appear as entries in
the Gram Matrix G = x i  x j in the training phase and later in the test
phase they only appear in an inner product with the training points
xi  z .
5.5 Vector/Matrix Representation of the
Optimization Problem and Summary
5.5.1
Vector/Matrix Representation
To give a first impression on how the above problems can be solved using
a computer, the problem(s) will be formulated in the equivalent notation
with vectors and matrices. This notation is more practical, understandable
and are used in many implementations.
As described above, the convex quadratic optimization problem which
arises for hard ( C   ), 1-norm ( C   ) and 2-norm (change the Gram
matrix x i  x j by means of adding 1/C to the diagonal) margin is the following:
58
Maximize
l
LD ( w, b, ξ, α,β)  W (α)    i 
i 1
1 l
 i  j y i y j x i  x j
2 i , j 1
l
subject to
 y
i 1
i
i
0
0  i  C
i = 1…l
This problem can be expressed as:
Maximize
subject to
eTα 
0  i  C
yTα  0
1 T
α Qα
2
(5.18)
i = 1…l
where e is the vector of all ones, C > 0 the upper bound, Q is a l by l positive semidefinite10 matrix, Qij  y i y j x i  x j .
And with a correct training set S = ((x1,y1), …, (xl,yl)) with the length of l
(5.18) would look like:
 1 
 

1 1 ... 1 2   1  1  2
...
2
 
 
 n
5.5.2
 Q11 Q12

...
Q
...  l  21
...
...

Q
 l 1 ...
... Q1l   1 
 
... ...   2 
... ...  ... 
 
... Q ll   l 
Summary
As seen in chapter 4 quadratic problems with a so called positive (semi-)
definite matrix are convex functions. This allows the crucial concepts of
solutions to convex functions to be adapted (see chapter 4: convex, KT).
10
Semidefinite: For each  ,  T Q  0 (Q has non-negative eigenvalues). Also see
next page for explanation
59
In former chapters the convexity of the objective function has been assumed without proof.
So let M be any (possibly non-square) matrix and set A = MTM. Then A is
a positive semi-definite matrix since we can write
x T Ax  x T M T Mx  (Mx ) T Mx  Mx  Mx  Mx
2
 0,
(5.19)
for any vector x. If we take M to be the matrix whose columns are the vectors x i , i = 1…l, then A is the Gram Matrix ( x i  x j ) of the set S = (x1,
…, xl), showing that Gram Matrices are always positive semi-definite.
And therefore the above matrix Q also is positive semi-definite.
Summarized, the problem to be solved up to now can be stated as
Maximize
l
LD ( w, b, ξ, α,β)  W (α)    i 
i 1
1 l
 i  j y i y j x i  x j
2 i , j 1
(5.20)
l
subject to
 y
i 1
i
i
0
0  i  C
i = 1…l
with the particularly simple primal KT conditions as criterions for a solution
to the 1-norm optimization problem:
0  i  C
 i  0  y i ( w *  xi  b* )  1
0   i  C  y i ( w *  xi  b* )  1
(5.21)
 i  C  y i ( w *  xi  b* )  1
Notice that the slack variables  i do not need to be computed for this
case, because as seen in chapter 5.3.1, they will only be non-zero if
 i  C and  i  0 . And so recall the primal of this chapter, stated as
l
l
l
1
LP ( w, b, ξ, α, ß) 
w  w  C   i    i [ y i ( w  x i  b )  1   i ]   ßi  i
2
i 1
i 1
i 1
Then set  i  C ,  i  0 , so the third sum is zero and from the second
60
l
sum we get
 C
i 1
l
i
which is equivalent to C   i and so it will be deleted
i 1
and no slack variable is there anymore.
For the maximal margin case the conditions will be:
i  0
 i  0  y i ( w *  xi  b* )  1
(5.22)
0   i  y i ( w *  xi  b* )  1
And last but not least for the 2-norm case:
i  0
 i  0  y i ( w *  xi  b* )  1
0   i  y i ( w *  xi  b* )  1
(5.23)
i
C
The last condition is reformulated by means of implicitly defining  i with

LP  Cξ  α  0 of chapter 5.3.2 and
help of the primal KT condition
ξ
therefore
i 
i
.
And with the complementary KT condition
C
 i [y i ( w  x i  b)  1   i ]  0 the third condition above is gained.
As seen in the soft margin chapters, points for which the second equation
holds are Support Vectors on one of the margin hyperplanes and for which
the third one holds are inside the margin, therefore called “margin-errors”.
These KT conditions will be used later and proof to be important when implementing algorithms for computational numerical solving the problem of
(5.20). Because a point is an optimum of (5.20), if and only if the KT conditions are fulfilled and Qij  y i y j x i  x j is positive semi-definite. The
second requirement is proven above.
And after the training process (the solving of the quadratic optimization
problem and as a solution getting the vector α and therefore bias b), the
classification of unseen data z is performed by
61
f ( z, α * , b * )  sgn( w *  z  b * )
l
 sgn(  y i  i* x i  z  b * )
(5.23)
i 1
 sgn(  y i  i* x i  z  b * )
i SV
where the x i are the training points with their corresponding  i greater
than zero and upper bounded by C and therefore support vectors.
As one can think now the question arising here is why always classify new
data by the use of the  i and why not simply saving the resulting weight
vector w ? Sure up to now it will be possible to do that and so no further
need of having to store the training points x i and their labels y i . But as
seen above there will be very few support vectors normally and only them
and their corresponding x i and y i are necessary to reconstruct w. But the
main reason will be given in chapter 5, where we will see that we must
use the  i and not simply store w.
To give a short link to the implementation issue discussed later, it can be
said that in most cases the 1-norm is used, because in real-world applications you normally will not have noise-free, linear separable data, and
therefore the maximal margin approach will not lead to satisfactory results.
But the main problem is still the selection of the used feature data in practice. The 2-norm is used in fewer cases, because it is not easy to integrate
in the SMO algorithm, discussed in the implementation chapter.
62
Chapter 6
Nonlinear Classifiers
The last chapter showed how the linear classifiers can easily be computed
by means of standard optimization techniques. But linear learning machines are restricted because of their limited computational power as highlighted in the 1960’s by Minsky and Papert. Summarized it can be stated
that real-world applications require more expressive hypothesis spaces
than linear functions. Or in other words, the target concept may be too
complex to be expressed as a “simple” linear combination of the given attributes (That’s what linear machines do), equivalent to: the decision function is not a linear function of the data. This problem can be overcome by
the use of the so called kernel technique. The general idea is to map the
input data nonlinearly to a (nearly always) higher dimensional space and
then separate it their by linear classifiers. Therefore this will result in a
nonlinear classifier in input space (see figure 6.1). Another solution to this
problem has been proposed in the neural network theory: Multiple layers
of thresholded linear functions which led to the development of multi-layer
neural networks.
Figure 6.1: Simpler classification task by a feature map(Φ). 2-dimensional Input space
on the left, 2-dimensional feature space on the right, where we are able to separate by a
linear classifier which leads to the nonlinear classifier in input space.
63
6.1
Explicit Mappings
Now the representation of training examples will be changed by mapping
the data to a (possibly infinite dimensional) Hilbert space11 F. Usually the
space F will have a much higher dimension than the input space X. The
mapping Φ : X  F is applied to each labelled example before training
and then the optimal separating hyperplane is constructed in the space F.
Φ : X  F  x  ( x1,... x n )  Φ(x)  (1(x)... n (x))
(6.1)
This is equivalent to mapping the whole input space X into F.
The components of Φ(x ) are called features, while the original quantities
are sometimes referred to as the attributes. F is called the feature space.
The task of choosing the most suitable representation of the data is known
as feature selection. This can be a very difficult task. There are different
approaches existing to feature selection. Frequently one seeks to identify
the smallest set of features that still conveys the essential information contained in the original attributes. This is known as dimensionality reduction,
x  ( x1... x n )  Φ(x)  ( 1 (x ),...  d (x )), d  n
(6.2)
and can be very beneficial as both computational and generalization performance can degrade as the number of features grows, a phenomenon
known as the curse of dimensionality. The difficulties one is facing with
high dimensional feature spaces is, that since the larger the set of (probably redundant) features is, the more likely it is that the function to be
learned could be represented using a standardised learning machine.
Another approach to feature selection is the detection of irrelevant features and their elimination. As an example consider the gravitation law,
which only uses information about the masses and the position of two bodies. So an irrelevant feature would be the colour or the temperature of the
two bodies.
So as a last word to say on feature selection, it should be considered well
as a part of the learning process. But it is also naturally a somewhat arbitrary step, which needs some prior knowledge on the underlying target
function. Therefore recent research has been done on the techniques for
11
Is a vector space with some more restrictions. A space H is separable if there exist a
countable subset D  H , such that every element of H is the limit of a sequence of elements of D . A Hilbert space is a complete separable inner product space. Finite dimensional vector spaces like n are Hilbert spaces. This space will be described in detail a
little further in this chapter and for further readings see [Nel00].
64
feature reduction. However in the rest of the diploma we do not talk about
the feature selection techniques because as Christianini and ShaweTaylor proofed in their book [Nel00] we can afford to use infinite dimensional feature spaces and avoid computational problems by the means of
the implicit mapping, described in the next chapter. So the “curse of dimensionality” can be said to be irrelevant by implicitly mapping the data,
also known as the Kernel Trick.
Before illustrating the mapping with an example, first notice that the only
way in which data appears in the training problem is in the form of dot
products x i  x j . Now suppose this data is first mapped to some other
(possible infinite dimensional) space F, using the mapping of (6.1):
Φ : n  F
Then of course, as seen in (6.1) and (6.2) the training algorithm would only
depend on the data through dot products in F, i.e. on functions of the form
Φ( x i )  Φ( x j ) (all other variables are scalars). Second there is no vector
l
mapping to w via Φ , but we can write w in the form w    i y i Φ( x i ) and
i 1
the
whole
hypothesis
(decision)
functions
will
be
of
the
type
l
f ( x )  sgn(  w i  i ( x )  b ) ,
i 1
or reformulated
l
f ( x )  sgn(   i y i Φ(x i )  Φ(x)  b) .
i 1
So a support vector machine is constructed which “lives” in the new higher
dimensional space F but all the considerations of the former chapters will
still hold, since we are still doing a linear separation, but in different space.
But now a simple example with explicit mapping.
Consider a given training set S = ((x1,y1), …, (xl,yl)) of points in  with
class labels +1 and -1: S  {( 1,1), (0,1), ( 1,1} . Trivially these three
points are not separable by a hyperplane, here a point12, in  (see figure
6.2). So first the data is nonlinearly mapped to the  3 by applying
12
Input dimension is 1, therefore the hyperplane is of the dimension 1 – 1 = 0, and dim(0)
is defined as 1.
65
 x2 


Φ :    3 , x   2x 


 1 
Figure 6.2: A non-separable example in the input space  . The hyperplane would be a
single point but it cannot separate the data points.
This Step results in a training set consisting of the vectors
 1   0   1 
    

  2 ,  0 ,  2  with the corresponding labels (+1, -1, +1). As illustrat 1   1   1 
    

ed in figure 6.3 the solution in the new space  3 can be easily seen geo 1
 
metrically in the Φ1Φ2 -plane (see figure 6.4). It is therefore w   0  ,
0
 
which is almost normalized yet meaning it has a length of 1, and the bias b
becomes b = -0.5 (negative b means moving the hyperplane running
through the origin in the “positive direction”). So it can be seen that the
learning task can be easily solved in the  3 by linear separation. But how
does the decision function look like in the original space  , where we
need it ?
66
l
Remember that w can be written in the form w    i y i Φ( x i ) .
i 1
Figure 6.3: Creation of a separating hyperplane, i.e. a plane in the new space 3 .
Figure 6.4: Looking at the Φ1Φ2 -plane, the solution to w and b can be easily given by
geometric interpretation of the picture.
67
 1 3
 
And in our particular example it can be written as : w   0     i y i Φ(x i )
 0  i 1
 
And worked out:
 1
 1 
0
 1 
 1 
0
 1 
 


 
 


 
 
 0    1 * 1 *   2    2 * ( 1) *  0    3 * 1 *  2    1   2    2  0    3  2 
0
 1 
 1
 1 
 1 
  1
 1 
 


 
 


 
 
1 
1 2 
 
 
The solving vector α    2  is then α   1  .
 
1 2 
 3
 
With the equation Φ(x1 )  Φ(x 2 )  x12 x 22  2x1x 2  1  ( x1x 2  1) 2
(6.3)
The hyperplane in  then becomes with x i the original training points in
:
y i  w  z  b  0 ; i = 1…3
3
1
0
2
i 1
1
1
1
 * 1 * ( z  1) 2  ( 1)  * 1 * ( z  1) 2   0
2
2
2
1
1
1
1 1
 z 2  z   1 z 2  z    0
2
2
2
2 2
1
 z2 
2

 y
i
i
Φ(x i )  Φ(z) 
This leads to the nonlinear hyperplane in  consisting of two points:
z1  1 2 and z 2   1 2 .
As seen in equation (6.3), the inner product in the feature space has a
equivalent function in the input space. Now we introduce an abbreviation
for the dot product in feature space:
K (x, z) : Φ(x)  Φ(z)
(6.4)
Clearly, that if the feature space is very high-dimensional, or even infinite
dimensional, the right-hand side of (6.4) will be very expensive to com-
68
pute. The observation in (6.3) together with the problem described above
motivates to search for ways to evaluate inner products in feature space
without making direct use of the feature space nor the mapping Φ . This
approach leads to the terms Kernel and Kernel Trick.
6.2
Implicit Mappings and the Kernel Trick
Definition 6.1 (Kernel Function)
Given a mapping Φ : X  F from input space X to an (inner product) feature space13 F, we call the function K : X  X   a kernel function if for
all x, z  X
(6.5)
K (x, z)  Φ(x)  Φ(z) .
The kernel function then behaves like an inner product in feature space
but can be evaluated as a function in input space.
For example take the polynomial kernel K ( x, y)  x  y
d
. Now assume
we have got d = 2 and x, y   (original input space), so we get:
2
13
Inner product space: A vector space X is called a inner product space if there exists a
bilinear map (linear in each argument) that for each two elements x, y  X gives a real
number denoted by x  y satisfying the following properties:

xy  yx

x  x  0 , and x  x  0  x  0
e.g.:
X  n , x  ( x1...xn ), y  ( y1...yn ), i are fixed positive numbers.
Then the following defines a valid inner product:
n
xy 
 x y
i i i
 xT Ay
i 1
where A is the n x n diagonal (only diagonal non-zero) matrix with non-zero entries Aii  i .
69
xy
2
 x   y 
   1    1  
 x2   y 2 
2
  x 12   y 12  

 

   2 x1 x 2    2y 1y 2  
 
 
 
2
2
x
y
2
2





 Φ(x)  Φ(y)
(6.6)
So the data is mapped to the  3 .
But the second line can be left out by implicitly calculating Φ(x)  Φ(y)
with the vectors in input space:
2
  x1   y 1  
        ( x1y 1  x 2 y 2 ) 2  x12 y 12  2x1 x 2 y 1y 2  x 22 y 22
 x   y 
 2   2 
what is the same as in the above calculation first mapping the input vectors to the feature space and then calculating the dot product:
  x12   y 12  

 

Φ(x)  Φ(y)    2 x1 x 2    2y 1y 2    x12 y 12  2 x1 x 2 y 1y 2  x 22 y 22 .
 
 
 
2
2
 x2   y 2 
So by implicitly mapping the input vectors to the feature space we are able
to calculate the dot product there without even knowing the underlying
mapping Φ !
Summarized it can be stated that by implicitly performing such a non-linear
mapping to a higher dimensional space, it can be done without increasing
the number of parameters, because the kernel function computes the inner product in feature space only by use of the two inputs in input space.
To generalize here, a polynomial kernel K( x, y)  x  y
d
with d  2 and
attributes in input space of dimension n maps the data to a feature space
 n  d  1 14
 . In the example of (6.6) this means with n = 2
of dimension 
d


and d = 2:
14
n
n!
  
, called the binomial coefficient
 k  k! (n  k )!
70
 n  d  1  2  2  1
3!
3 * 2 *1

  
 

3
d
2  1!*2! 1 * 2 * 1

 
And as can be seen above the data is really mapped from the  2 to the
3 .
In figure 6.5 the whole “new” procedure for classification of an unknown
point z is shown, after training of the kernel-based SVM and therefore having the optimal weight vector w (defined by the  i ’s, the corresponding
training points x i and their labels y i ) and the bias b.
Figure 6.5: The whole procedure for classification of a test vector z (in this example the
test and training vectors are simple digits).
To stress the important facts, summarizing it can be said that in contrast to
the example in chapter 6.1 the chain of arguments is inverted in that way,
that there we started by explicitly defining a mapping Φ before applying the
learning algorithm. But now the starting point is choosing a kernel function
K which implicitly defines the mapping Φ and therefore avoiding the feature space in the computation of inner products as well in the whole design
of the learning machine itself. As seen above both the learning and test
step only depend on the value of inner products in feature space. Hence,
as shown, they can be formulated in terms of kernel functions. So once
such a kernel function has been chosen, the decision function for unseen
data z, (5.23), becomes:
71
l
f ( z )  sgn(  w i  i ( z )  b )
i 1
l
 sgn(   i y i K ( x i , z )  b )
(6.7)
i 1
And as said before as a consequence we do not need to know the underlying feature map to be able to solve the learning task in feature space !
Remark: As remarked in chapter 5, the consequence of using kernels is
that now the direct storing of the resulting weight vector w is not practicable, because as seen in (6.7) above, then we have to know the mapping
and cannot use the advantage arising by the usage of kernels.
But which functions can be chosen as kernels ?
6.2.1
Requirements for Kernels - Mercer’s Condition -
As a first requirement for a function to be chosen as a kernel, definition 6.5
gives two conditions because the mapping has to be to an inner product
feature space. So it can be easily seen that K has to be a symmetric function:
K (x, z)  Φ(x)  Φ(z)  Φ(z)  Φ(x)  K (z, x)
(6.8)
And another condition for an inner product space is the Schwarz Inequality:
K ( x, z ) 2  Φ(x)  Φ(z)
2
 Φ(x)
2
Φ(z)
2
 Φ(x)  Φ(x) Φ(z)  Φ(z)  K ( x, x )K ( z, z )
(6.9)
However these conditions are not sufficient to guarantee the existence of
a feature space. Here Mercer’s Theorem gives sufficient conditions (Vapnik 1995; Courant and Hilbert 1953).
The following formulation of Mercer’s Theorem is given without proof, as it
is stated in the paper of [Bur98].
72
Theorem 6.2 (Mercer’s Theorem)
There exist a mapping Φ and an expansion K (x, y)  Φ(x)  Φ(y ) 
   i (x) i ( y) if and only if, for any g(x) such that
 g(x)
2
dx   (is finite)
(6.10)
then
 K (x, y)g (x )g ( y)dxdy  0
(6.11)
Note: (6.10) has to hold for every g satisfying (6.10). This theorem is also
sufficient for the infinite case.
Another simplified condition for K to be a kernel in the finite case can be
seen from (6.8), (6.9) and when describing K with its’ eigenvectors and
eigenvalues (The proof is given in [Nel00]).
Proposition 6.3
Let X be a finite input space with K(x, z) a symmetric
function on X. Then K(x, z) is a kernel function if and only if the matrix K is
positive semi-definite.
Therefore Mercer’s Theorem is an extension of this proposition based on
the study of integral operator theory.
6.2.2
Making Kernels from Kernels
Theorem 6.2 is the basic tool for verifying that a function is a kernel. The
remarked proposition 6.3 gives the requirement for a finite set of points.
Now this criterion for a finite set is applied to confirm that a number of new
kernels can be created. The next proposition of Christianini and John
Shawe-Taylor [Nel00] allows creating more complicated kernels from simple building blocks:
73
Proposition 6.4 Let K 1 and K 2 be kernels over X  X, X  n , a    , f()
a real-valued function on X,
Φ : X  m
with K 3 a kernel over  m x m ,p() a polynomial with positive coefficients
and B a symmetric positive semi-definite n x n matrix. Then the following
functions are kernels, too:

K(x, z) = K 1 (x, z) + K 2 (x, z)

K(x, z) = a K 1 (x, z)

K(x, z) = K 1 (x, z) K 2 (x, z)

K(x, z) = f(x)f(z)

K(x, z) = K 3 ( Φ(x), Φ(z))

K(x, z) = x T Bz

K(x, z) = p( K 1 (x, z))

K(x, z) = exp( K 1 (x, z))
(6.12)
6.2.3
Some well-known Kernels
The selection of a kernel function is an important problem in applications
although there is no theory to tell which kernel when to use. Moreover it
can be very difficult to check that some particular kernel satisfies Mercer’s
conditions, since they must hold for every g satisfying (6.10). In the following some well known and widely used kernels are presented.
Selection of the kernel, perhaps from among the presented ones, is usually based on experience and knowledge about the classification problem at
hand, and also theoretical considerations. The problem of choosing a kernel and its’ parameters on the basis of theoretical considerations will be
discussed in chapter 7. Each kernel will be explained below.
74
 xz
 c
p
Polynomial
K(x, z) =
(6.13)
Sigmoid
K(x, z) = tanh(  x  z   )
(6.14)
Radial Basis Function
- Gaussian Kernel -
6.2.3.1
K(x, z) = exp( 
xz
2
)
2 2
(6.15)
Polynomial Kernel
Here p gives the degree of the polynomial and c is some non-negative
constant, usually c = 1. Usage of another generalized inner-product instead of the standard inner product above was proposed in many other
works on SVMs because the Hessian matrix then becoming zero in numerical calculations (this means, no solution for the optimization problem).
Then the Kernel will become:
 xz

K(x, z) =   i i  c 
 i i

p
Where the vector σ is such chosen that the function satisfies Mercer’s condition.
75
Figure 6.6: A polynomial kernel of degree 2 used for the classification of the nonseparable XOR-data set (in input space, by a linear classifier). Each colour represents
one class and the dashed lines mark the margins. The level of shading indicates the functional margin, or in other words: The darker the shading of one colour representing a
specific class, the more confident the classifier is that this point in that region belongs to
that class.
6.2.3.2
Sigmoid-function
The Sigmoid kernel stated above usually satisfies Mercer’s condition only
for certain values of the parameters  and  . This was noticed experimentally by Vapnik. Currently there are no theoretical results on the parameter values that satisfy Mercer’s conditions. As stated in [Pan01] the
usage of the sigmoid kernel with the SVM can be regarded as a two-layer
neural network. In such networks the input vector z is mapped by the first
layer into the vector F = ( F1...FN ), where Fi  tanh(  x i  z   ) , i= 1…N
and the dimension of F is called the number of the Hidden Units. In the
second layer the sign of the weighted sum of the elements of F is calculated by using weights  i . Figure 6.7 illustrates that.
The main difference to notice between SVMs and two-layer neural networks is the different optimization criterion: In the SVM case the goal is to
find the optimal separating hyperplane which maximizes the margin (in the
feature space), while in a two-layer neural network the criterion is usually
to minimize the empirical risk associated with some loss function, typically
the mean squared error.
76
F1
F2
1
2
z
ŷ
.
.
.
+1 / -1
N
FN
Figure 6.7: A 2-layer neural network with N hidden units. Output of the first layer are of
the form Fi  tanh( xi  z   ) , i = 1…N. While the output of the whole network then
N
becomes yˆ  sgn(
 y  F  b) .
i i i
i 1
Another important notice should be given here: In neural networks the optimal network architecture is quite often unknown and mostly found only by
experiments and/or prior knowledge, while in the SVM case such problems are avoided. Here the number of hidden units is the same as the
number of support vectors and the vector of the weights in the output layer
(  i   i y i ) are all determined automatically in the linearly separable case
(in feature space).
6.2.3.3
Radial Basis Function (Gaussian)
The Gaussian Kernel is also known as the Radial Basis Function.
In the above function (6.15),  (variance) defines a so called window
width (width of the Gaussian). Sure it is possible to have different window
widths for different vectors, meaning to use a vector σ (see [Cha00]).
As some works show [Lin03], the RBF-kernel will be a good starting point
for a first try if one knows nearly nothing about the data to classify. The
main reasons will be stated in the upcoming chapter 7, where also the parameter selection will be discussed.
77
Figure 6.8: A SVM with a Gaussian Kernel, a value of sigma   0.1 and with the application of the maximal margin case (C = inf) on an artificially generated training set.
Another remark to mention here is that up to now the algorithm, and so the
above introduced classifiers, are only intended for the binary case. But as
we will see in chapter 8 this can be easily extended to the Multiclass case.
6.3
Summary
Kernels are a very powerful tool when dealing with nonlinear separable
datasets. The usage of the Kernel Trick has long been known and therefore was studied in detail. By its’ usage the problem to solve now still stays
the same as in the previous chapters, but the dot product in the formulas
is rewritten, using the implicit kernel mapping.
So the problem can be stated as:
Maximize
l
LD ( w, b, ξ, α,β)  W (α)    i 
i 1
1 l
  i  j y i y j K (x i ; x j )
2 i , j 1
l
subject to
 y
i 1
i
i
(6.16)
0
0  i  C
i = 1…l
78
And with the same KT-conditions as in the summary under 5.5.2.
Then the overall decision function for some unseen data z becomes:
f ( z, α * , b * )  sgn( w *  z  b * )
l
 sgn(  y i  i* K ( x i ; z )  b * )
i 1
(6.17)
 sgn(  y i  i* K ( x i ; z )  b * )
i SV
Note: This Kernel-representation will now be used and to give the link to
the linear case of chapter 5 where K ( x i ; x j ) is “replaced” by x i  x j , this
“kernel” will be called the Linear Kernel.
79
Chapter 7
Model Selection
Though as introduced in the last chapter, without building an own kernel
based on the knowledge about the problem at hand, as a first try it is intuitive to use the four common and well known kernels. This approach is
mainly used as the examples in appendix A will show. But as a first step
there is the choice which kernel to use for the beginning. Afterwards the
penalty parameter C and the kernel parameters have to be chosen, too.
7.1
The RBF Kernel
As suggested in [Lin03] the RBF kernel is in general a reasonable first
choice. Although if the problem at hand is nearly the same as some already formidable solved ones (hand digit recognition, face recognition, …),
which are documented in detail, a first try should be given to the kernels
used there. But the parameters mostly have to be chosen in other ranges
applicable to the actual problem. Some examples of such already solved
problems and links to further readings about them will be given in appendix A.
As shown in the last chapter the RBF kernel, as others, maps samples into
a higher dimensional space so it is able to, in contrast to the linear kernel,
handle the case where the relation between class labels and the attributes
is nonlinear. Furthermore, the linear kernel is a special case of the RBF
one as [Kel03] shows that the linear kernel with a penalty parameter
C has the same performance as the RBF kernel with some parameters (C,
 )15. In addition, the sigmoid kernel behaves like RBF for certain parameters [Lil03].
Another reason is the number of hyperparameters which influence the
complexity of model selection. The polynomial kernel has more of them
than the RBF kernel.
15
 
1
2 2
80
Finally the RBF kernel has less numerical difficulties. One key point is
0  K ij  1 in contrast to polynomial kernels of which the kernel values
may go towards infinity. Moreover, as said in the last chapter, the sigmoid
kernel is not valid (i.e. not the inner product of two vectors) under some
parameters.
7.2
Cross-Validation
In the case of RBF kernels there are two tuning parameters: C and  . It
is not known beforehand which values for them are the best for the problem at hand. So some ‘parameter search’ must be done to identify the optimal ones. Optimal ones means, finding C and  so that the classifier can
accurately predict unknown data after training, i.e. testing data. In this way
it will not be useful to achieve high training accuracy by the cost of generalization ability. Therefore a common way is to separate the training data
into two parts of which one is considered unknown in training the classifier.
Then the prediction accuracy on this set can more precisely reflect the
performance on classifying unknown data. An improved version of this
technique is known as cross-validation.
In the so called k-fold cross-validation , the training set is divided into k
subsets of equal size. Sequentially one subset is tested using the classifier
trained on the remaining k – 1 subsets. Thus, each instance of the whole
training set is predicted once, so the cross-validation accuracy is the percentage of data which are correctly classified. The main disadvantage of
this procedure is its’ computational intensity, because the model has to be
trained k times. A more simple technique can be extracted from this model
by choosing k-fold cross-validation with k = 1. This means sequentially
removing the i-th training subset and train with the remaining subsets. This
procedure is known as Leave-One-Out (loo).
Another technique is known as Grid-Search. This approach has been chosen by [Lin03]. The main idea behind this is to basically try pairs of ( C,  )
and the one with the best cross-validation accuracy is picked. Mentioned
in this paper is the observation, that trying exponentially growing sequences of C and  is a practical way to find good parameters. This means,
e.g., C  2 5 ,2 3 ,..., 215 ;   2 15 ,2 13 ,..., 2 3 . Sure this search method is
straightforward and ‘stupid’ in some way. But as said in the paper above,
sure there are advanced techniques for grid-searching, but they are an
exhaustive parameter search by approximation or heuristics. Another reason is that it has been shown that the computational time to find good parameters by the original grid-search is not much more than that by advanced methods, since there are still the same two parameters to be optimized.
81
Chapter 8
Multiclass Classification
Up to now the study has been limited to the two-class case, called the binary case, where only two classes of data have to be separated. However,
in real-world problems there are in general m  2 classes to deal with. The
training set still consists of pairs (x i , y i ) where x i   n but now
y i  {1,......, n} with i = 1 … l. The first straightforward idea will be to reduce
the Multiclass problem to many two-class problems, so each resulting
class is separated from the remaining ones.
8.1
One-Versus-Rest (OVR)
So as mentioned above the first idea for a procedure to construct a multiclass classifier is the construction of n two-class classifiers with following
decision functions:
(8.1)
fk (z)  sgn( w k  z  bk ) ; k = 1…n
This means that the classifier for class k separates this class from all other
classes:
+1 if x belongs to class k
f k (x) 
-1 otherwise
So the step-by-step procedure starts with class one: construct first binary
classifier for class 1 (positive) versus all others (negative), class 2 versus
all others, ……., class k (=n) versus all others.
The resulting combined OVR decision function chooses the class for a
sample that corresponds to the maximum value of k binary decision functions (i.e. the furthest “positive” hyperplane). For clarification see figure
8.1 and table 8.1. This whole first approach to gain a Multiclass classifier
is computationally very expensive, because there is need of solving n
82
quadratic programming (QP) optimization problems of size l (training set
size) now. As an example consider a three-class problem with linear kernel introduced in figure 8.1. The OVR method yields a decision surface
divided by three separating hyperplanes (the dashed lines). The shaded
regions in the figure correspond to tie situations, where two or none classifiers are active, i.e. vote positively at the same time (also see table 8.1).
Figure 8.1: OVR applied to a three-class (A, B, C) example with linear kernel
Now consider the classification of a new unseen sample (hexagonal in
figure 8.1) in the ambiguous region 3. This sample receives positive votes
from both the A-class and C-class binary classifiers. However the distance
of the sample from the “A-class-vs.-all” hyperplane is larger than from the
“C-class-vs.-all” one. Hence, the sample is classified to belong to the A
class.
In the same way the ambiguous region 7 with no votes is handled.
So the final combined OVR decision function results in the decision surface separated by the solid line in figure 8.1. Notice however that this final
decision function significantly differs from the original one, which corresponded to the solution of k (here 3) QP optimization problems. The major
drawback here is therefore that only three points (black balls in figure 8.1)
of the resulting borderlines coincide with the original ones, calculated by
the n Support Vector Machines. So it seems that the benefits of maximal
83
margin hyperplanes are lost. Summarized it can be said that this is the
simplest Multiclass SVM method [Krs99 and Stat].
Decision of the classifier
Resulting
Region
A vs. B and
C
B vs. A and
C
C vs. A and
B
class
1
2
3
4
5
6
7
A
A
A
-
B
B
B
-
C
C
C
-
?
C
?
A
?
B
?
Table 8.1:
Three binary OVR classifiers applied to the corresponding example (figure
8.1). The column “Resulting class” contains the resulting classification of each region.
Cells with “?” correspond to tie situations when two or none classifier are active at the
same time. See text for how ties are resolved.
8.2
One-Versus-One (OVO)
The idea behind this approach is to construct a decision function
f km :  n  {1,1} for each pair of classes (k, m) ; k, m = 1 … n:
f km (x) 
+1 if x belongs to class k
-1 if x belongs to class m
 n  n(n  1)
So in total there are   
pairs, because this technique involves
2
 2
the construction of the standard binary classifier for all pairs of classes. In
other words, for every pair of classes, a binary SVM is solved with the underlying optimization problem to maximize the margin. The decision function therefore assigns an instance to a class, which has the largest number of votes after the sample has been tested against all decision func-
84
 n  n(n  1)
tions. So the classification now involves   
comparisons and
2
 2
in each one the class to which the sample belongs in that binary case,
get’s a “1” added to its’ number of votes (“Max Wins” strategy). Sure that
there can still be tie situations. In such a case, the sample will be assigned
based on the classification provided by the furthest hyperplane as in the
OVR case [Krs99 and Stat].
As some researchers have proposed, this can be simplified by choosing
the class with the lowest index when a tie occurs, because even then the
results are still mostly accurate and approximated enough [Lin03] without
additional computation of distances. But this has to be verified for the
problem at hand.
The main benefit of this approach is that for every pair of classes the optimization problem to deal with is much smaller, i.e. in total there is only
need of solving n(n-1)/2 QP problems of size smaller than l (training set
size), because there are only two classes involved and not the whole training set in each problem as in the OVR approach.
Again consider the three-class example from the previous chapter. Using
the OVO technique with a linear kernel, a decision surface is divided by
three separate hyperplanes (dashed lines) obtained by the binary SVMs
(see figure 8.2). The application of “Max Wins” strategy (see table 8.2) results in the division of the decision surface into three regions (separated
by the thicker dashed lines) and the small shaded ambiguous region in the
middle.
After the tie-braking strategy from above (furthest hyperplane) applied to
the ambiguous region 7 in the middle, the final decision function becomes
the solid black lines and the thicker dashed ones together. Notice here
that the final decision function does not differ significantly from the original
one corresponding to the solution of n(n-1)/2 optimization problems. So
the main advantage here in contrast to the OVR technique is the fact, that
the final borderlines are parts of the calculated pair wise decision functions, which was not the case in the OVR approach.
85
Figure 8.2: OVO applied to the three class example (A, B, C) with linear kernel
Decision of the classifier
Region
1
2
3
4
5
6
7
Resulting
A vs. C
B vs. C
A vs. B
class
C
C
A
A
A
C
C
C
C
C
B
B
B
B
B
A
A
A
B
B
A
C
C
A
A
B
B
?
Table 8.2: Three binary OVO classifiers applied to the corresponding example (figure 8.2). The
column “Resulting class” contains the resulting classification of each region according to “Max
Wins” strategy. The only cell with “?” corresponds to the tie situation when three classifier are
active at the same time. See text for how this tie is resolved.
86
8.3
Other Methods
The above methods are only some of the ones usable for Multiclass
SVMs, but they are the most intuitive ones. Other methods are e.g. the
usage of binary decision trees which are nearly the same as the OVO
method. For details on them see [Pcs00].
Another method was proposed by Weston and Watkins (“WW”) [Stat and
WeW98]. In this technique the n-class case is reduced to solving a single
quadratic optimization problem of the new size (n-1)*l which is identical to
binary SVMs for the case n = 2. There exist some speed-up techniques for
this optimization problem, called decomposition [Stat], but the main disadvantage is that the optimality of this method is not yet proven.
An extension to this was given by Cramer and Singer (“CS”). There the
same problem of the WW approach has to be solved, but they managed to
reduce the number of slack variables in the constraints of the optimization
problem, and hence it is computationally cheaper. Also there exist some
techniques known as decomposition for speed-up [Stat]. But unfortunately,
same as above, the optimality has not been demonstrated yet.
But which method is suitable now for a certain problem ?
As shown in [WeW98] and other papers the optimal technique is mostly
the WW approach. This method has shown the best results in comparison
to OVO, OVR and the binary decision trees. But as this method is not
proven yet to be optimal and there is some need of reformulation of the
problem, it is not easy to implement.
As a good compromise the OVO method could be chosen. This method is
mainly used by the actual implementations and has shown to produce
good results [Lin03].
Vapnik himself has used the OVR method, what is mainly attributed to the
smaller steps for computing. Because in the OVR case there is only a
need for constructing n hyperplanes, one for each class, while in the OVO
case there are instead n(n-1)/2 ones to compute. So the use of the OVR
technique decreases the computational effort by a factor of (n-1)/2.
The main advantage, compared with the WW method, is that in OVR (as in
OVO) one is able to choose different kernels for each separation, which is
not possible in the WW case, because it is a joint computation [Vap98].
87
Part III
Implementation
88
Chapter 9
Implementation Techniques
In the previous chapters it was showed that the training of Support Vector
Machines can be reduced to maximizing a convex quadratic function with
subject to linear constraints (see chapter 5.5.1). Such convex quadratic
functions have only one local maxima (the global one) and their solution
can always be found efficiently. Furthermore the dual representation of the
problem showed how the training could be successfully performed even in
very high dimensional feature spaces.
The problem of minimizing differentiable functions of many variables has
been widely studied, especially in the convex case, and most of the standard techniques can be directly applied to SVM training. However there exist specific techniques to exploit particular features of this problem. For
example the large size of the training set is a formidable obstacle to a direct use of standard techniques, since just storing the kernel matrix requires a memory space that grows quadratically with the sample size.
9.1
General Techniques
A number of optimization techniques have been devised over the years,
and many of them can be directly applied to quadratic programms. As examples think of the Newton method, conjugate gradient or the primal dual
interior-point methods. They can be applied to the case of Support Vector
Machines straightforward. Not only this, they can also be considerably
simplified because of the fact, that the specific structure of the objective
function is given.
Conceptually they are not very different from the simple gradient ascent16
strategy, known from the Neural Networks. But many of this techniques
require that the kernel matrix is stored completely in memory. The quadratic form in (5.18) involves a matrix that has a number of elements equal to
the square of the number of training examples. This matrix then e.g. cannot fit into a memory of size 128 Megabytes if there are more than 4000
training examples (assuming each element is stored as an 8-byte double
precision number). So for large size problems the approaches described
16
For an adaption to SVMs, see [Nel00]
89
above can be inefficient or even impossible. So they are used in conjunction with the so called decomposition techniques (“Chunking and Decomposition”, for explanation see [Nel00]). The main idea behind this methods
is to subsequently optimize only a small subset of the problem in each iteration.
The main advantages of such techniques is that they are well understood
and widely available in a number of commercial and freeware packages.
These were mainly used for Support Vector Machines before special algorithms were developed. The most common algorithms were, for example,
the MINOS package from the Stanford Optimization Laboratory (hybrid
strategy) and the LOQO package (primal dual interior-point method). In
contrast to these, the quadratic program subroutine qp provided in the
MATLAB optimization toolbox is very general but the routine quadprog is
significantly better than qp.
9.2
Sequential Minimal Optimization (SMO)
The algorithm used in nearly any implementation of SVMs in a slightly
changed manner and in the one of this diploma thesis, too, is the SMO
algorithm. It was developed by John C. Platt [Pla00] and its’ main advantage besides being one of the most competitive is the fact that it is
simple to implement.
The idea behind this algorithm is derived by taking the idea of the decomposition method to its extreme and optimizing a minimal subset of just two
points at each iteration. The power of this approach resides in the fact that
the optimization problem for two data points admits an analytical solution,
eliminating the need to use an iterative quadratic program optimizer as
part of the algorithm. So SMO breaks the large QP problem into a series
of smallest possible QP problems and solves them analytically, which
avoids using a time-consuming numerical QP optimization as an inner
loop. Therefore the amount of memory required for SMO is linear in the
training set size, no more quadratically, which allows SMO to handle very
large training sets. The computation time of SMO is mainly dominated by
SVM evaluation, which will be seen below.
The smallest possible subset for optimization involves two Lagrange multipliers, because the multipliers must obey the linear equality constraint (of
l
5.20)
 y
i 1
i
i
 0 and therefore updating one multiplier  k , at least one
other multiplier  p ( k  p , and 0  k , p  l ) has to be adjusted in order to
keep the condition true.
90
At every step, SMO chooses two Lagrange multipliers to jointly optimize,
finds the optimal values for them, and updates the SVM to reflect the new
optimal values.
So the advantage of SMO, to repeat it again, lies in the fact that solving for
two Lagrange multipliers can be done analytically. Thus, an entire inner
iteration due to numerical QP optimization is avoided. Even though more
optimization sub-problems are solved now, each sub-problem is so fast
solvable, such that the overall QP problem can be solved quickly (comparison between the most commonly used methods can be found in [Pla00]).
In addition, SMO does not require extra matrix storage (ignoring the minor
amounts of memory required to store any 2x2 matrices required by SMO).
Thus, very large SVM training problems can fit even inside of the memory
of an ordinary personal computer.
The SMO algorithm mainly consists of three components:



An analytic method to solve for the two Lagrange multipliers
A heuristic for choosing which multipliers to optimize
A method for computing the bias b
As even mentioned in chapter 5.2.1 the computation of the bias b can be
problematic, when simply taking the average value for b after summing up
all calculated b’s for each i. This was shown by [Ker01]. The main problem
arising when using an averaged value of the bias for recalculation in the
SMO algorithm is, that the convergence speed of it is not guaranteed.
Sometimes it is slower and sometimes it is faster. So Keerthi suggested an
improvement for the SMO algorithm where two threshold values bup and
blow are used instead of one. It has been shown in this paper that the modified SMO algorithm is more efficient on any tested dataset in contrast to
the original one. The speed-up is significant !
But as a first introduction the original SMO algorithm will be used here and
can be extended later. Before continuing, one disadvantage of the SMO
algorithm should be stated here. In the original form implemented in nearly
any toolbox, it cannot handle the 2-norm case. Because the KT-conditions
are others, as can be seen in chapter 5.5.2. Therefore nearly any toolbox,
which wants to implement the 2-norm case, uses optimization techniques
mentioned above. Only one implements the 1- and 2-norm case at the
same time with an extended form of the SMO algorithm (LibSVM by ChihJen Lin). The 2-norm case will also be added to the developed SMO algorithm in this diploma thesis.
As it will be seen, SMO will spend most of the time evaluating the decision
function, rather than performing QP, it can exploit data sets which contain
substantial number of zero elements. Such sets will be called sparse.
9.2.1
Solving for two Lagrange Multipliers
91
First recall the general mathematical formulated problem:
Maximize
l
1 l
LD ( w, b, ξ, α,β)  W (α)    i    i  j y i y j x i  x j
2 i , j 1
i 1
l
 y
subject to
i 1
i
i
0
0  i  C
i = 1…l
With the following KT conditions fulfilled, if the QP problem is solved for all
i (for maximal-margin and 1-norm):
0  i  C
 i  0  y i ( w *  xi  b* )  1
0   i  C  y i ( w *  xi  b* )  1
 i  C  y i ( w *  xi  b* )  1
For convenience, all quantities referring to the first multiplier will have a
subscript 1 and those referring to the second a subscript 2. Without the
other subscript “old”, they are meant to be the just optimized values “new”.
For initializing α old is set to zero.
In order to take the step to the overall solution two  i ’s are picked and
SMO calculates the constraints on these two multipliers and then solves
for the constrained maximum. Because there are only two constraints
now, they can be easily displayed in two dimension (see figure 9.1). The
constraints 0   i  C cause the Lagrange multipliers to lie inside a box,
l
while the linear equality constraint
 y
i 1
i
i
 0 causes them to lie on a di-
agonal line. Thus, the constrained maximum of the objective function
W (α) must lie on a diagonal line segment (explanation in figure 9.1 and
following pages).
In other words, to not violate the linear constraint on the two multipliers
they must fulfil:  1y 1   2 y 2  const.   1old y 1   2old y 2 (lie on a line) in the
box constrained by 0  1, 2  C .
So this one-dimensional problem resulting from the restriction of the objective function to such a line can be solved analytically.
92
Figure 9.1: Two cases of optimization: y1  y 2 and y 1 y 2 . The two Lagrange multipliers chosen for subset optimization must fulfil all of the constraints of the full problem. The
inequality constraints cause them to lie inside a box and the linear equality constraint
causes them to lie on a diagonal line. Therefore, one step of SMO must find an optimum
of the objective function on a diagonal line segment. In this figure,   1old  m2old ,
which is a constant that depends on the previous values of 1 ,  2 and
m  y1y 2  {1;1} . (   1  m2  1old  m2old )
Without loss of generality, the algorithm first computes the second multiplier  2 and computes the ends of the diagonal line segments in terms of
this one. So it is successively used to obtain  1 . The bounds on the new
multiplier  2 can be formulated more restrictive with use of the linear constraint and the equality constraint (also see figure 9.2). But first recall for
each  i : 0   i  C and also the linear constraint has to hold:
l
 y
i 1
i
i
 0 . Using the two actual multipliers to be optimized we write
l
l
i 3
i 3
 1 y 1   2 y 2    i y i and therefore  1y 1   2 y 2   where     i y i .
There are two cases to consider (remember y i  {1,1} ):
93
Figure 9.2: Case 1: y1  y 2 .  ,  ' and the two lines, indicating the cases where
1   2 or 1  2
Case 1: y 1  y 2
then
1   2  
(9.1)
Case 2: y 1  y 2
then
1   2  
(9.2)
Then let m  y 1y 2 , then the two above equations can be written as
1  m 2  
(9.3)
and before optimization    1old  m 2old .
Then the end points of the searched diagonal line (figure 9.2 and 9.3) can
be expressed with help of the old, possibly not optimized values:
Case 1: y 1  y 2
 1old   2old  
L (  2 at the lower end point) is:
max (0,   ) = max (0,  2old  1old )
H (  2 at the higher end point) is:
min (C, C   ) = min (C, C + 2old  1old )
94
Figure 9.3: Case 2: y1  y 2 .  ,  ' and the two lines, indicating the cases where
1   2 or 1  2
 1old   2old  
Case 2: y 1  y 2
L (  2 at the lower end point) is:
max (0,   C ) = max (0,  1old   2old  C )
H (  2 at the higher end point) is:
min (C,  ) = min (C,  1old   2old )
As a summary the bounds on  2 are:
where, if y 1  y 2 :
L = max (0,  2old   1old )
H = min (C, C   1old   2old )
and if y 1  y 2 :
L = max (0,  1old   2old  C )
H = min (C,  1old   2old )
(9.4)
95
L  2  H
At a first glance, this only appears to be applicable to the 1-norm case, but
treating C as infinite for the hard-margin case reduces the constraints on
the interval [L, H]:
L  2  H
where, if y 1  y 2 :
L = max (0,  2old   1old )
only lower bounded and if y 1  y 2 :
L=0
H =  1old   2old
Now that the other  i ’s are assumed fixed, the objective function
W ( 1, 2 ) = LD can be rewritten (as abbreviation x i  x j = x iT x j is written
here as x i x j ):
LD   1   2  const. 
1 2
( 1 y 1y 1x 1x 1   22 y 2 y 2 x 2 x 2  2 1 2 y 1y 2 x 1x 2 ) 
2
l
2(  i y i x i )( 1y 1x 1   2 y 2 x 2 )  const.)
i 3
“const.” are the parts dependable on the multipliers not optimized in this
step, so they are regarded as constant values simply added. Now for simplification assume the following substitutions:
K11  x 1x 1 , K 22  x 2 x 2 , K 12  x 1x 2 and
l
v j   i y i xix j
i 3
As in figure 9.1, assume m  y 1y 2 and with the equality constraint we get
1y 1   2 y 2  const. , multiplied with y 1 leading to  1    m 2 ( y 1y 1  1)
where    1  m 2   1old  m 2old .
And resubstituting all these relations back into LD the formula becomes:
96
LD   1   2 


1 2
 1 K 11   22 K 22  2m 1 2 K 12  2 1v 1y 1  2 2v 2 y 2  const.
2
l
1 l
 i j y i y j K ij
2 i , j 3
i 3
And by using the help of  to only have a function dependable on  2 :
Where const. is
 i 
1
(K 11 (  m 2 ) 2  K 22  22  2mK12 (  m 2 ) 2
2
 2v 1 y 1 (  m 2 )  2v 2 y 2 2 )  const.
LD    m 2   2 
1
1
1
K 11 2  K 11m 2  K 11 22  K 22  22  mK12  2
2
2
2
2
 K 12  2  v 1 y 1  v 1 y 1m 2  v 2 y 2 2  const.
   m 2   2 
 W ( 2 )
To find the maximum of this function there is need for the first and second
derivate of W with respect to  2 :
W
 m  1  mK 11  K 11 2  K 22 2  mK 12  2K 12 2  mv 1y 1  v 2 y 2
 2
 m  1  mK11 (  m 2 )  K 22 2  K 12 2  mK12 (  m 2 )  y 2 (v 1  v 2 )
where my1  y 1y 2 y 1  y 12 y 2  y 2
 2W
 2K 12  K 11  K 22  
 2 2
The following new notation will simplify the statement. f (x ) is the current
hypothesis function y i ( w  x i  b) determined by the values of the actual
vector α and the bias b at a particular stage of learning. So the following
new introduced element E is the difference between the function output
(classification by the up to now trained machine) and the target classification (given by the supervisor in the training set) on the training points x 1 or
x 2 . Meaning this is the training error on the ith example.
l
E i  f (x i )  y i  u i  y i  (  j y j K ( x j , x i )  b)  y i ; i = 1, 2
(9.5)
j 1
97
This value may be large even if a point is correctly classified. As an example if y 1  1 and the function output is f (x 1 ) = 5, the classification is correct but E1 = 4.
l
Recall the substitution v j    i y i x i x j so from 9.5 u i is written as:
i 3
l
u1    j y j K 1 j  b
j 1
l
   j y j K 1 j   1old y 1K 11   2old y 2 K 12  b
j 3
 v 1  b   1old y 1K 11   2old y 2 K 12
and so
v 1  u1  b   2old y 2 K 12   1old y 1K 11
 u1  b   2old y 2 K 12  my 2 1old K 11
(9.6)
l
u2   j y j K 2 j  b
j 1
l
   j y j K 2 j   1old y 1K 21   2old y 2 K 22  b
j 3
 v 2  b   1old y 1K 12   2old y 2 K 22
and so
v 2  u 2  b   1old y 1K 12   2old y 2 K 22
At the maximal point the first derivate
(9.7)
W
is zero and the second one
 2
has to be negative. Hence
 2 (K11  K 22  K12 )  m(K11  K12 )  y 2 (v 1  v 2 )  1  m
And with equations 9.6 and 9.7 this becomes (remember m 2  1 and
y i2  1):
 2 K 11   2 K 22  2 2 K 12  1  m  y 2 (v 1  v 2 )  m (K 11  K 12 )
98
(  ) 2  m (K 11  K 12 )  y 2 (u1  b  u 2  b )   2old K 12  m 1old K 11  m 1old K 12
  2old K 22  y 22  y 1 y 2
 m( 1old  m 2old )(K 11  K 12 )   2old K 12  m 1old K 12  m 1old K 11   2old K 22
 y 2 (u1  u 2  y 2  y 1 )
 m 1old K 11  m 1old K 12   2old K 11   2old K 12   2old K 12  m 1old K 12  m 1old K 11
  2old K 22  y 2 (u1  y 1  u 2  y 2 )
  2old ( 2K 12  K 11  K 22 )  y 2 (E 1  E 2 )
 (  ) 2old  y 2 (E 1  E 2 )
So the new multiplier  2 can be expressed as:
 2new   2old 
y 2 (E 1  E 2 )
(9.8)

This is the unconstrained maximum, so this has to be constrained to lie
within the ends of the diagonal line, meaning L   2new  H (see figure
9.1):
H
 2new ,clipped =
 2new if L   2new  H
L
The
value
of
1new  m 2new ,clipped  1old
if  2new  H
(9.9)
if  2new  L
is
obtained
 1new
 m 2old and therefore
 1new   1old  m( 2old   2new ,clipped )
from
equation
(9.10)
As stated above, the second derivate has to be negative to ensure a maximum. But under unusual circumstances it will not be negative. A zero
second derivate can occur if more than one training example has the
same input vector x . In any event, SMO will work even if the second deri-
99
vate is not negative, in which case the objective function W should be
evaluated at each end of the line segment. Then SMO uses the Lagrange
multipliers at the end point, which yields the highest value of the objective
function. These circumstances are regarded and “solved” in the next subchapter about choosing the Lagrange multipliers to be optimized.
9.2.2
Heuristics for choosing which Lagrange Multipliers to optimize
The SMO algorithm is based on the evaluation of the KT conditions. Because when every multiplier fulfils these conditions of the problem, the solution is found. These KT conditions normally are verified to a certain tolerance level  . As Platt mentioned in his paper, the value of  is typically
in the range of 10 2 to 10 3 implying that e.g. outputs on the positive (+1)
margin are between 0.999 and 1.001. Normally this tolerance is enough
when using an SVM for recognition. Applying higher accuracy the algorithm will not converge very fast.
There are two heuristics used for choosing the two multipliers to optimize.
The choice of the first heuristic for  1old provides the outer loop of the SMO
algorithm. This loop first iterates over the entire training set, determining
whether an example violates the KT conditions. If so, then this example is
immediately chosen for optimization. The second example, and therefore
the candidate for  2old is found by the second choice heuristic and then
these two multipliers are jointly optimized. At the end of this optimization,
the SVM is updated and the algorithm resumes iterating over the training
examples looking for KT violators. To speed up the training, the outer loop
does not always iterate over the entire training set. After one pass through
the training set, the outer loop only iterates those examples whose Lagrange multipliers are neither 0 nor C (the non-bound examples). Then
again, each example is checked against the KT conditions, and violating
ones are chosen for immediate optimization and update. So the outer loop
makes repeated passes over the non-bound examples until all of them
obtain the KT conditions within the tolerance level  . Then the outer loop
iterates over the whole training set again to find violators. So all in all the
outer loop keeps altering between single passes over the whole training
set and multiple passes over the non-bound subset until the entire set
obeys the KT condition within the tolerance level  . At this point the algorithm terminates.
Once the first Lagrange multiplier to be optimized is chosen, the second
one has to be found. The heuristic for this one is based on maximizing the
step that can be taken during joint optimization. Evaluating the Kernel
function for doing so will be time-consuming, so SMO uses an approximation on the step size by using equation (9.8). So the maximum possible
step size is the one having the biggest value E1  E 2 . To speed up, a
100
cached error value E is kept for every non-bound example from which
SMO chooses the one to approximately maximize the step size. If E1 is
positive, then the example with minimum error E 2 is chosen. If E1 is negative, then the example with largest error E 2 is chosen.
Under unusual circumstances, as the ones remarked at the end of the last
sub-chapter (two identical training vectors), SMO cannot make positive
progress using this second choice heuristic. To avoid this, SMO uses a
hierarchy of second choice heuristics until it finds a pair of multipliers,
making positive progress. If there is no positive progress using above approximation, the algorithm starts iterating through the non-bound examples
at a random position. If none of them makes positive progress the algorithm starts iterating through the entire training set at a random position to
find a suitable multiplier  2old that will make positive progress in the joint
optimization. The randomness in choosing the starting position is used to
avoid bias towards examples stored at the beginning of the training set. In
very extreme degenerative cases, a second multiplier making positive
progress cannot be found. In such cases the first multiplier is skipped and
a new one is chosen.
9.2.3
Updating the threshold b and the Error Cache
Since the solving for the Lagrange multipliers does not determine the
threshold b of the SVM, and there is need for updating the value of the
error cache E at the end of each optimization step, the value of b has to be
re-evaluated after each optimization. So b is re-computed after each step,
so that the KT conditions are fulfilled for both optimized examples.
Now let u1 be the output of the SVM with the old  1 and  2 :
l
u1   1old y 1 K 11   2old y 2 K 12    i y i K 1i  b old
(9.11)
i 3
u1  E1  y 1
(9.12)
As in figure 9.4, if the new  1 is not at the bounds, then the output of the
SVM after optimization on example 1 will be y 1 , its label value. And therefore:
l
y 1   1new y 1K 11   2new ,clipped y 2 K 12    i y i K 1i  b1
(9.13)
i 3
And substituting (9.13) and (9.11) into (9.12) :
101
b1  E1  b old  y 1 ( 1new   1old )K 11  y 2 ( 2new ,clipped   2old )K 12
(9.14)
Similarly obtaining an equation for b2 , such that the output of the SVM
after optimization is y 2 when  2 is not at the bounds:
b2  E 2  b old  y 1 ( 1new   1old )K 12  y 2 ( 2new ,clipped   2old )K 22
(9.15)
When both b1 and b2 are valid, they are equal (see figure 9.4 again).
When both new calculated Lagrange multipliers are at the bound and if L
is not equal to H, then the interval [b1, b2 ] describes all threshold conb  b2
sistent with the KT conditions. Then SMO chooses b to be b new  1
.
2
This formula is only valid, if b is subtracted from the weighted sum of the
kernels, not added. If one multiplier is at the bound and the other one not,
then the value of b calculated using the non-bound multiplier is used as
the new updated threshold. As mentioned above, this step is regarded as
problematic by [Ker01]. But to avoid this, the original SMO algorithm discussed here has to be modified in its’ whole and therefore only a reference
to the improved algorithm is given here. The modified pseudo code will be
stated together with the original one in the appendix.
As seen in the former chapter, a cached error value E is kept for every
example whose Lagrange multiplier is neither zero nor C (non-bound). So
if a Lagrange multiplier is non-bound after being optimized, its’ cached error is zero (it is classified correctly). Whenever a joint optimization occurs,
the stored error of the other not involved multipliers have to be updated
using the following equation:
E inew  E iold  u inew  u iold
E inew  E iold  u inew  u iold
And re-substituted this becomes:
E inew  E iold  y 1 ( 1new   1old )K 1i  y 2 ( 2new ,clipped   2old )K 2i  b old  b new
(9.16)
102
Figure 9.4: Threshold b when both  ’s are bound (== C). The support vectors A and B
give the same threshold b, that is the distance of the optimal separating hyperplane from
the origin. Point D and E give b1 and b2 respectively. They are error points within the
margin. The searched b is somewhere between b1 and b2 .
Overall, when an error value E is required by the SMO algorithm, it will
look it up in the error cache if the corresponding Lagrange multiplier is not
at bound. Otherwise, it will evaluate the current SVM decision function
(classify the given point and compare it to the given label) based on the
current  ’s.
9.2.4
Speeding up SMO
There are certain points in the SMO algorithm, where some useful techniques can be considered to speed up the calculation. As said in the
summary on linear SVM, it is possible there to store the weight vector directly, rather than all of the training examples that correspond to non-zero
Lagrange multipliers. But this optimization is only possible for the linear
kernel. After the joint optimization succeeded, the stored weight vector
must be updated to reflect the new Lagrange multipliers found. This update is easy, due to the linearity of the SVM:
w new  w old  y 1 ( 1new   1old )x 1  y 2 ( 2new ,clipped   2old )x 2
103
This is a speed-up because much of the computation time in SMO is spent
to evaluate the decision function, and therefore speeding up the decision
function speeds up SMO. Another optimization that can be made is using
the sparseness of the input vectors. Normally, an input vector is stored as
a vector of floating point numbers. A sparse input vector (with zeros in it) is
stored by the meaning of two arrays: id and val. The id array is an integer
array storing the location of the non-zero inputs, while the val array is a
floating point array storing the corresponding non-zero values. Then the
very often used computation of the dot product between such stored vectors (id1, val1, length=num1) and (id2, val2, length=num2) can be done
quickly, as shown in the pseudo code below:
p1 = 0, p2 = 0, dot = 0
while (p1 < num1 && p2 < num2)
{
a1 = id1[p1], a2 = id2[p2]
if (a1 == a2)
{
dot += val1[p1]*val2[p2]
p1++, p2++
}
else if (a1 > a2)
p2++
else
p1++
}
This can be used to calculate linear and polynomial kernels directly.
Gaussian kernels can also use this optimization through the usage of the
following identity:
xy
2
 x  x  2x  y  y  y
To speed up more in the Gaussian case, for every input the dot product
with itself can be pre-computed.
Another optimization technique for linear SVMs regards the weight vector
again. Because it is not stored as a sparse array, the dot product of the
weight vector with a sparse input vector (id, val) can be expressed as:
104
num
 w [id[i ]] * val [i ]
i 0
And for binary inputs storing the array val is not even necessary, since it is
always 1. Therefore the dot product calculation in the pseudo code above
becomes a simple increment and for a linear SVM the dot product of the
weight vector with a sparse input vector becomes:
num
 w [id[i ]]
i 0
As mentioned in Platt’s paper there are more speed-up techniques that
can be used but they will not be discussed in detail here.
9.2.5
The improved SMO algorithm by Keerthi
In his paper [Ker01] Keerthi points out some difficulties encountered in the
original SMO algorithm by explicitly using the threshold b for checking the
KT conditions. His modified algorithm will be stated here as Pseudo-Code
with a little explanation, but for further details please refer to Keerthi’s paper.
Keerthi uses some new notations:
Define Fi  w  x i  y i . Now the KT conditions can be expressed as:
i  0
0  i  C
i  C
 y i (Fi  b )  0
 y i (Fi  b )  0
 y i (Fi  b )  0
and these can be written as:
i  I 0  I1  I 2  Fi  b
i  I 0  I 3  I 4  Fi  b
where
I 0  {i : 0   i  C }
I 1  {i : y i  1, i  0}
I 2  {i : y i  1, i  C }
I 3  {i : y i  1, i  C }
I 4  {i : y i  1, i  0}
105
And now to check if the KT conditions hold, Keerthi also defines:
bup  min{ Fi : i  I0  I1  I2 }  Fi _ up
blow  max{ Fi : i  I0  I3  I 4 }  Fi _ low
(A)17
(B)
The KT conditions then imply bup  blow and similarly
i  I 0  I1  I 2 , Fi  blow and i  I 0  I 3  I 4 , Fi  bup .
These comparisons do not use the threshold b !
As an added benefit, given the first  1old , these comparisons automatically
find the second multiplier for joint optimization.
The pseudo code, as it can be found in Keerthi’s paper, can be found in
appendix D.
As seen in the pseudo code and in Keerthi’s paper, there are two modifications on the SMO algorithm. Both were tested in the paper on different
datasets and showed a significant speed-up in contrast to the original
SMO algorithm by Platt. Also they overcome the problem arising when
only using a single threshold (an example, why there are arising problems
can also be found in Keerthi’s paper). As a conclusion on all tests Keerthi
showed that the second modifications fares better overall.
9.2.6
SMO and the 2-norm case
As stated before, the SMO algorithm is not able to handle the 2-norm case
without altering the code. Recall that there are two differences to the maximal margin and the 1-norm case: First the addition of 1/C to the diagonal
of the kernel matrix and second the altered KT conditions, which are used
in SMO as the stopping criterion:
i  0
 i  0  y i ( w *  xi  b* )  1
0   i  y i ( w *  xi  b* )  1
i
C
And as the original SMO algorithm tests the KT conditions only in the outer
loop when selecting the first multiplier to optimize, this is the point to alter.
Also the kernel evaluation has to be extended to add the diagonal values.
In the pseudo-code above, the checking of the KT conditions is processed
by:
17
(A) and (B) are links to the pseudocode in the appendix
106
E2 = SVM output on point[i2] - y2 (check in error cache)
r2 = E2*y2
if ((r2 < -tol) || (r2 > tol && alph2 > 0))
where r2 is the same as y i f (x i )  1. So the KT conditions are tested
against  0 and  0 , where 0 is replaced by the tolerance “tol”. So for the
2-norm case the test is rewritten as:
E2 = SVM output on point[i2] - y2 (check in error cache)
r2 = E2*y2 + alph2/C
if ((r2 < -tol && alph2 < C) || (r2 > tol && alph2 > 0))
Second, as in the maximal margin case, the box constraint on the multipliers has to be removed, because they are no longer upper bounded by C.
And last but not least, the bias has to be calculated only using alphas fulfilling the equation 0   i  y i ( w *  x i  b * )  1 
i
C
.
9.3 Data Pre-processing
As one can read in [Lin03] they have some propositions on the handling of
the used data.
9.3.1
Categorical Features
SVMs require that each data instance is represented as a vector of real
numbers. Hence, if there are categorical attributes, they have to first be
converted into numeric data. Cheng recommends to use m numbers for
representing an m-category attribute. Then only one of the m numbers is
one, and the others are zero. Consider the three category attribute {red,
green, blue} which then can be represented as (0,0,1), (0,1,0) and (1,0,0).
Cheng’s experience indicates that if the number of values in an attribute is
not too many, this coding might be more stable than using a single number
to represent a categorical attribute.
107
9.3.2
Scaling
Scaling the data before applying it to an SVM is very important. [Lin03]
explains why the scaling is so important, and most of these considerations
also apply to SVMs.
The main advantage is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges. Another advantage is to avoid
numerical difficulties during the calculation. Because kernel values usually
depend on the inner products of feature vectors, large attributes may
cause numerical problems. So Cheng recommends linearly scaling each
attribute to the range of [-1, +1] or [0, 1]. In the same way, the testing data
then has to be scaled before testing it on the trained machine.
In this diploma thesis the most used scaling to [-1, +1] is used and the according formula for scaling an input x in this interval with length two is:
The components of and input x:= (x1 … xn)T are linearly scaled to the interval [-1, +1] with a length l of two by applying:
x i ,scal  l
x i  x i ,min
x i ,max  x i ,min

x i  x i ,min
l
2
1
2
x i ,max  x i ,min
with i  {1,2,..., n } . The scaling has to be done for each feature separately.
So the min- and max-values are taken, regarding the current feature in
each vector. To go in detail, the reason for doing this is follows:
Imagine a vector of 2 features (2-dimensional), the first has a value of 5,
the second of 5000. Assume the other vectors behave the same way. So
the first feature would not have a very great impact on distinguishing between the classes, because the change in feature one is numerically very
small in contrast to that of feature two, whose numbers are in a much
higher range.
Other long studied methods for scaling the data and showing very good
results use the co-variance matrix from the Gaussian theory.
9.4
Matlab Implementation and Examples
This chapter is intended to show some examples and to get an impression
how the different tuneable values, such as the penalty C, the kernel parameters and the choice of maximal margin, 1-norm or 2-norm, affect the
resulting classifier.
108
The implementation in Matlab with the original SMO algorithm can be
found here together with the training sets (these files were used for making the following pictures possible):
Matlab Files\SVM\18
It should be mentioned, that the SMO implementation in Matlab is somewhat very slow. Therefore nearly any Toolbox for SVMs available written in
Matlab implements the SMO algorithm as C-Code and calls it in Matlab
through the so-called “Mex-functions” (Interface to C/Matlab). But for examining the small examples used here, the use of pure Matlab is acceptable. Later the whole code for Support Vector Machines will be implemented in C++ anyway to be integrated in the “Neural Network Tool” already
existent at Siemens VDO.
For any upcoming visualisation the dimension of the training and test vectors is restricted to the two-dimensional case, because only such examples are visualizeable two- and three-dimensional, to be discussable. The
three-dimensional pictures will show the values calculated by the learned
decision function without applying the classification by the signum function
“sgn” to it on the z-axis. The boundary will be shaded too, respectively to
the functional margin of that point. Or in other words: The darker the shading, the more the point belongs to that specific class. The pictures will give
clarification on this.
9.4.1
Linear Kernel
For examples using the linear “kernel”, the linear separable cases of the
binary functions OR and AND are considered (figure 9.5 and 9.6).
The dashed lines represent the margin. The size of the functional margin
is indicated by the level of shading.
A test of the same machine on the XOR case results in a classification
with one error because of the nature of the XOR function to be non separable in input space (figure 9.6).
18
A complete list with the usage and a short description of each file will be given in chap-
ter 10.
109
Figure 9.5: A linear kernel with maximal margin (C = inf) applied to the linear separable
case of the binary OR function.
Figure 9.6: A linear kernel with maximal margin (C = inf) applied to the linear separable
case of the binary AND function.
110
Figure 9.7: A linear kernel with soft margin (C = 1000) applied to the linearly nonseparable case of the XOR function. The error is 0.25 %, as one point is misclassified.
9.4.2
Polynomial Kernel
As seen before, the XOR case is non-separable in input space. Therefore
the usage of a kernel mapping the data to a higher space and separating it
there linearly could produce a classifier in input space, separating the data
correctly. To test this, a polynomial kernel with maximal margin (C = inf) of
degree two is used. The result can be seen in figure 9.8.
To get an impression on how this data becomes separable by mapping it
to a higher dimensional space, the three-dimensional picture in figure 9.9
visualizes the output of the classification step before applying the signum
(sgn) function to it on the z axis.
111
Figure 9.8: A polynomial kernel of degree 2 with maximal margin (C = inf) applied to the
XOR dataset.
112
Figure 9.9: The classificator of figure 9.8 visualized by showing the calculated value of
the classification on the z axis before the application of the signum (sgn) function.
Here one can see that the yellow regions applying to one of the classes
have greater positive values and the green region applying to the other
class has values lower than zero. The change of separation from one
class to the other is at the zero level of the classifier output (z axis), as the
signum function changes sign there.
The main conclusion drawing from the pictures up to now and from further
ones is, that the application of a kernel measures the similarity between
the data in some way. Because regarding the last two figures again, one
can see that the points belonging to the same class are mapped to the
same “direction” (output values >= 0 or < 0). The upcoming pictures on the
Gaussian kernel will stress this fact.
9.4.3
Gaussian Kernel (RBF)
As stated in the chapter on the kernels, if one has no idea on how the data
is dependable, as a first start the Gaussian kernel(s) or in other words, the
radial basis function(s) is/are a good choice.
Sure in the XOR case applying this kernel will be the same as shooting
with canons on sparrows, but the pictures resulting from doing so anyway,
stress the fact that a kernel measures the similarity of data in some way
(the resulting value before applying the signum function). Another fact is
113
that here the result of changing the sigma value (variance, “window width”,
see 6.2.3) can be seen quite clear.
Figure 9. 10: The RBF kernel applied to the XOR data set with   0.1 and maximal
margin (C = inf).
To see how the change of the sigma value (variance) affects the resulting
classifier, compare figures 9.10 and 9.11 to figures 9.12 and 9.13. Notice
the smoother and wider course of the curves at the given training points.
114
Figure 9.11: The classificator of figure 8.9 (   0.1 ), visualized by showing the calculated value of the classification on the z axis before the application of the signum (sgn) function. Remarkable are the “Gauss curves” at the position of the four given training points
(Here the classificator is more confident that a point in that region belongs to the specific
class).
Figure 9.12: The RBF kernel applied to the XOR data set with   0.5 and maximal
margin (C = inf).
115
Figure 9.13: : The classificator of figure 9.9 (   0.5 ), visualized by showing the calculated value of the classification on the z axis before the application of the signum (sgn)
function. Remarkable are the “Gauss curves” at the position of the four given training
points. But in contrast to figure 9.11 with a value of sigma   0.1 they are much
smoother and “wider”, as sigma changes the “width” (Consider the affect of the variance
in Gaussian distribution).
To get an impression on how different values of the penalty parameter C
(soft margin case for 0 < C < inf) affect the resulting classifier the next pictures illustrate this application of C.
As a starting point assume the classification problem of figure 9.14, classified by a SVM with a Gaussian kernel using   0.2 and the maximal
margin concept, allowing no training errors. The resulting classification
regions are not very smooth, due to the two training points lying in the
midst of the other class. Therefore applying the same machine on the dataset but with the soft margin approach by applying the upper bound by
setting C to five results in the classifier of picture 9.15.
Here the whole decision boundary is much smoother than in the maximal
margin case. The main “advantage” is the broader margin, implying a better generalization. This fact is also stressed in the figures of 9.16 and the
next sub chapter.
116
Figure 9.14: A Gaussian kernel with   0.2 and maximal margin (C = inf). The dashed
margins are not really “wide”, because of the two points lying in the midst of the other
class and the application of the maximal margin classifier (no errors allowed).
117
Figure 9.15: A Gaussian kernel with   0.2 and soft margin (C = 5). This approach
gives smoother decision boundaries in contrast to the classifier in figure 8.14 but at the
expense of misclassifying two points now.
9.4.4
The Impact of the Penalty Parameter C on the
Resulting Classifier and the Margin
Now the change of the resulting classifier (boundary, margins) when applying the maximal margin and the soft margin approach will be analyzed
in detail.
Assume the training set used in figure 9.16. The SVM used there is based
on a Gaussian kernel applying the concept of the maximal margin approach, allowing no training error (C = inf). As one can see, the resulting
classifier does not have a very broad margin. And therefore, as stated in
the Theory on Generalization in part one of this diploma thesis, this classifier is assumed not to generalize very well.
In contrast to this the approaches in figures 9.17 to 9.19 use the soft margin optimization and result in a broader margin but this on the expense of
allowing training errors. But such “errors” can also be interpreted as the
classifier does not overestimate the influence of some “outliers” in the
training set (because of such ones the “hill” in figure 9.16 is in the midst of
where one can imagine the other class should be).
118
Figure 9.16: A SVM with a Gaussian kernel with   0.8 and maximal margin (C = inf).
The resulting classifier is compatible with the training set without error, but has no broad
margin.
So these classifiers are assumed to generalize better in this case, what is
the goal of a classifier:
He must generalize very well but minimize the error of classification.
As stated in chapter two another very general estimation of the generalization error of SVMs are the number of support vectors gained after training:
# SV
l
So small numbers of support vectors are expected to give better generalisation. Another advantage in practice is, that the fewer support vectors
there are the less expensive is the computation of the classification of a
point.
So to summarize, as the theory on generalization stated, a broad margin
and few support vectors are indications for good generalization. So the
application of the soft margin approach can be seen as a compromise between minor empirical risk and minor optimism.
119
Figure 9.17: A SVM with a Gaussian kernel with   0.8 and soft margin (C = 100).
Notice the broader margin in contrast to figure 9.16. The boundary has become smoother
and the three (four, one is a margin error, the others are “real” errors) misclassified points
do not have as much impact on the boundary as in figure 9.16.
Figure 9.18: A SVM with a Gaussian kernel with   0.8 and soft margin (C = 10). Notice the broader margin in contrast to figure 9.16 and 9.17. The boundary is much more
smoother.
120
Figure 9.19: A SVM with a Gaussian kernel with   0.8 and soft margin (C = 1). Notice
the broader margin in contrast to figure 9.16 ,9.17 and 9.18, and the much smoother
boundary.
121
Part IV
Manuals, Available Toolboxes and
Summary
122
Chapter 10
Manual
As said at the beginning, one of the goals was to implement the theory into
a computer program for practical usage. This program was first developed
in Matlab Release 12 for better debugging and doing demanding graphical output. All figures from the last chapter were done with this implementation and after reading this chapter you should also be able to use the
files created. After testing the whole theory there extensively the code was
ported to C++ in a module to be integrated into the already existent “Neural Network Tool”.
10.1
Matlab Implementation
First the Matlab approach was used because of the better debugging possibilities of the algorithm. Also the development was faster here because
of the mathematical nature of the problem. But the main advantage was
the graphical output already possible with Matlab.
Figure 10.1: The disk structure for all files associated with the Matlab implementation
The next table summarizes all files created for the Matlab implementation.
An example on their usage can be seen after it.
123
Path
File
Description
Remarks
Classifier
kernel_func
kernel_Eval
Binary
Case
Internally used for
kernel calculation
Evaluate the chosen
kernel function for the
given data
Files associated with
the 2-class problem
Implementation of the
original SMO-Algorithm
Multiple return
values
Implementation of the
improved SMO-Algorithm by Keerthi with
modification 2
Classification of an
unlabeled data example after training
Multiple return
values
Multiclass_SMO
Above SMO for the
multiclass case
Multiple return
values
Multiclass_SMO_
Keerthi
Above improved SMO
for the multiclass case
Multiple return
values
Classify a point after
training
Vector containing all votes for
each class is
returned; still tie
situations !
SMO
SMO_Keerthi
classify_Point
Value without
applied signum
function (sgn)
Multiclass
Multi_Classify_Poi
nt
Contains *.mat files
with prelabeled test
data for loading into
the workspace
Testdata
Util
check2ddata
createdata
Internally used by
“createdata”
Create 2-dimensional
prelabeled training
data for two- and mul-
Up to now only
the following
calling conven-
124
ticlass in a GUI; saveable to a file
tions are supported:
Createdata for
two-class case
Createdata(‘finite’,
nrOfClasses)
for creating multiclass test data
linscale
Scales the data to the
interval [-1, +1] linearly
If the data is not
stored with labels -1
and +1 for binary classification, this function
rewrites them.
makeTwoClass
Files for plotting a
trained classifier
Visual
Binary
case
svcplot2D
For the two-class
problem
Two-dimensional plot
of the trained classifier. The coloured
shaded regions represent the calculated
value of the classification for that point BEFORE applying the
signum (sgn) function.
Yellow for values >= 0
and therefore class +1
and green for values <
0 and therefore class -
Calling convention is:
makeTwoClass(data,
label_of_one_clas
s)
label_of_one_clas
s is then
mapped to +1
and the remaining ones to -1.
Only applicable
if the data/feature vectors are 2dimensional !
The dashed
lines represent
the margin.
125
svcplot3D
1. The darker the colour the greater the
value (see the legend)
Same as above but
The dashed
this three-dimensional lines represent
plot visualizes the cal- the margin.
culated value of the
classification BEFORE
applying the signum
(sgn) function in the
third dimension
Multiclass
Same as for the twoclass classification
svcplot2D_Mul above but for the probticlass
lem of three to a maximum of seven classes.
Table 10.1: List of files used in the Matlab implementation and their intention
10.2
Matlab Examples
Now two examples how to use the Matlab implementation in practice.
The first one is for the two-class problem and the other one shows how to
train a multiclass classifier.
First call createdata or load a predefined test data set into workspace. If
using the createdata function, the screen looks like figure 10.2 after generating some points by left-clicking with the mouse. You can erase points
by right-clicking on them and adjust the range of the axis by entering the
wanted values to the right. The class could be switched in the upper right
corner with the combo box. When ready, click Save and choose a filename and location for saving the newly generated data to. Close the window and load the file into the Matlab workspace. Then you should see a
vector X containing the feature data and a vector y containing the labels
+1 and -1 there.
126
Figure 10.2: The GUI of the createdata function after the creation of some points for
two-class classification by left-clicking on the screen.
Before training you have to specify a kernel to use. In this implementation
that is done by creating a field as follows:
myKernel.name = ‘text’
optional: myKernel.param1 = value_1
optional: myKernel.param2 = value_2
Values for text (the Kernel used):

linear

poly

rbf
value_1:
Not used for the linear kernel, for the polynomial one its’ the
dimension of it (degree) and for the RBF/Gaussian kernel its’
the value of sigma/window width.
value_2:
Only used for the polynomial kernel, where it’s the constant c
added.
If none of the last two parameters is given, default values are used. There
should be a new variable in the workspace called myKernel.
127
myKernel.name = ‘poly’
myKernel.param1 = 2
In this example we use:
Now we are ready for training. For this there are two functions available
with same calling convention:


SMO
SMO_Keerthi
As the names imply, the first one implements the original SMO algorithm
and the second one the improved algorithm by Keerthi with modification 2.
In any sense the second one should always be used, because as stated in
the former part, the original SMO is very slow and could run infinitely if you
choose to separate the data by meanings of hard margin but it is not separable without errors. To train the classifier simply call:
[alphas bias nsv trainerror] = SMO_Keerthi(X, y, upper_bound_C, eps, tol,
myKernel)
X is the training set
y are the labels (+1, -1)
upper_bound_C is either inf for the hard margin case or any value > 0 for
the soft-margin one (here: inf)
eps is the accuracy, normally set to 0.001
tol is the tolerance for checking the KKT conditions, normally 0.001
myKernel is the field created above
Returned values are:
alphas is the array containing the calculated Lagrange multipliers
bias is the calculated bias
nsv is the number of support vectors (alpha > 0)
trainerror is the error rate in % on the training set
If using the original function SMO(…) there is need for another parameter
after the myKernel variable: 2-norm, which is zero for using hard-margin
or 1-norm soft-margin and one for using the 2-norm.
After pressing Return the training process starts and you get the overview
as in figure 10.3 after the training has ended.
Now if the upper calling convention is used you got two newly created variables in the workspace for further usage: alphas and bias
Now the result can be visualized by using the functions svcplot2D and/or
svcplot3D, as can be seen in figure 10.4 and 10.5. They have the same
calling convention:
128
svcplot2D(X, y, myKernel, alphas, bias)
Where X is again the training data as before and also the labels y and
myKernel. Alphas and bias are the variables gained through the training
process.
Figure 10.3: After training you get the results: values of alphas, the bias, the training
error and the number of support vectors (nsv)
129
Figure 10.4: svcplot2D after the training of a polynomial kernel with degree two on the
training set created as in figure 10.2.
Figure 10.5: svcplot3D after the training of a polynomial kernel with degree two on the
training set created in figure 10.2.
The second example consist of four classes to show how multiclass classification works here. Again create a training set with createdata, but now by
calling (see figure 10.6):
createdata(‘finite’, 4)
130
Figure 10.6: GUI for creating a four-class problem
In this example we use a linear kernel: myKernel.name = ‘linear’
After again loading the created data into workspace, getting the variables
X and y, we are ready for training:
[alphas bias nsv trainerror overall_error] = Multiclass_SMO_Keerthi(X, y,
upper_bound_C, eps, tol, myKernel)
The only difference here to the binary case is the additional return value of
the overall error, because trainerror is the error rate of each classifier
trained during the process (you have multiple ones, because the OVOmethod is used).
After the training you get again the results as in figure 10.3. Now again the
trained classifier(s) can be plotted:
svcplot2D_Multiclass(X, mykernel, alpha, bias, nr_of_classes)
where X is the training data, nr_of_classes is the number of classes used
in training (here: 4) and the other parameters are the same as in the binary case above. This plot would take a little more time to show up but in the
end it looks like the one in figure 10.7.
131
Figure 10.7: svcplot2D_Multiclass called after the training of the four classes as created
in figure 10.6.
10.3
The C++ Implementation for the Neural Network
Tool
The main goal of this work was the integration of the SVM module into the
already existing “Neural Network Tool” created by Siemens VDO. The application GUI is shown in figure 10.8 with a test of the SVM. The tool was
consisting of two integrated classifiers: The polynomial one and the Radial
Basis Function classifier. It was capable of:

Train multiple instances of a classifier on separate or the same training set(s)

Visualizing data of two dimensions and the trained classifier

Storing results and the parameters of a classifier for loading an already trained classifier and test it on another data set
132
Figure 10.8: The Neural Network Tool with integrated SVM. Here an overlapping training
set was trained with the Gaussian/RBF kernel with no error.
The integration of the new module was “easy” because of the already
open system for further integration of new classification techniques. So the
todos where the following:

Programming of a control GUI for the SVM

Programming of the algorithms themselves

Store and load procedures for the relevant parameters to load a
trained classifier at a later time

Store procedures for the results of a training run
As the algorithms used have been tested extensively in Matlab they did
not need any further debugging here. And as a benefit of the saved time
therefore some additions were made not implemented in Matlab. For example now one is able to do a Grid Search for the upper bound C as de-
133
scribed in chapter 7.2 but without cross-validation. Algorithms implemented here are the original SMO with 1- and 2-norm capabilities and the improved SMO by Keerthi with modification two.
The program was split into few modules, which can be seen in figure 10.9.
Figure 10.9: The UML diagram for the integration of the SVM module
In figure 10.10 the main control dialog for configuring all relevant stuff for
SVM training can be seen.
At the top you see the actual file loaded for training or testing. Below on
the left you can select the kernel to use (without knowledge one should
start with the Gaussian/RBF one, default) and the algorithm of choice.
Keerthi should be selected always because of his big advantages described in the chapters beforehand. On the right hand side all other important variables are accessible, such as the upper bound C (checkbox for
the hard-margin case, if deselected you can enter an upper bound by
hand > 0), the kernel parameters (polynomial degree/constant or sigma for
the Gaussian/RBF kernel) ,the accuracy of calculation and the tolerance
for checking the KT conditions (default values here are 0.001).
134
Figure 10.10: Main control interface for configuring the important parameters for SVM
training
Remember that if you select the SMO 2-norm as algorithm, no hardmargin classification is possible and therefore it is not selectable then. The
input for the polynomial degree and sigma is shared in one edit box, indicated by the text next to it, which will be switched appropriate. In the lower
half you can check the box next to Upper Bound C for doing a grid search
over predefined values of C. This simply trains 12 classifiers with different
values for the upper bound C (currently these are: 25,23,..., 215 and infinity
for the hard-margin case; such exponentially growing values were recom-
135
mended by Lin as seen in chapter 7.2) and shows the results in a dialog
after training (see figure 10.11). Then one can select the best parameters
for training the classifier.
Figure 10.11: The results of a grid search for the upper bound C. From left to right it
displays: Number of support vectors (NSV), Kernel parameters (unused yet), the used
upper bound C and the training error in %. So one can easily see the general development of the training process for different values of C. Remarkable here is the fast decrease of the NSV with increasing C. As stated in chapter 9 most of the times the fewer
support vectors there are the better the generalisation will be. All in all this search helps a
lot finding the optimal value for the upper bound. With the later implementation of the grid
search for the kernel parameters this will be a powerful tool to find the best suited parameters for the problem at hand.
Last but not least with the Stop Learning button one can interrupt the training process and at the bottom of the dialog is a progress bar to give visual
feedback of the learning or testing progress.
As the Neural Network Tool is property of Siemens VDO I am not able to
include source files here or on the CD, but all Matlab files are included for
testing purposes.
136
10.4
Available Toolboxes implementing SVM
There are many toolboxes implemented in Matlab, C, C++, Python and
many more programming languages available on the internet and free of
charge for non-commercial usage. This chapter does not claim to show
them all but few ones which were used during the work on this diploma.
Some use alternative algorithms for solving the optimization problem arising in SVMs and others are based on modifications of SMO. All toolboxes
mentioned here are also available on the CD coming with this work.
A very good page with many resources and links can be found here:
http://www.kernel-machines.org
Most toolboxes are intended for usage under Linux/Unix, but there are
more and more ones ported to the Windows world. Some of them used
during the work are listed here:
NAME
DESCRIPTION
LINK
LibSVM
A SVM library in C with a graphical GUI. It is the basis for many
other toolboxes. The algorithm
implemented here is a simplification of SMO, SVMLight and
modification 2 of SMO by
Keerthi.
http://www.csie.ntu.edu
.tw/~cjlin/libsvm/
SVMLight
SVM in C with own algorithm.
Also used in other toolboxes
such as mySVM. It was tested
with superior success in text categorization on the Reuters data
set.
http://svmlight.joachim
s.org/
Statistical
Pattern
Recognition
Toolbox for
Matlab
A huge toolbox for Matlab from
the university of Prague. It implements many algorithms not
only SVM. Very comfortable because of the GUI.
http://cmp.felk.cvut.cz
mySVM
A toolbox in C based on the
SVMLight algorithm for pattern
recognition and regression.
http://www-ai.cs.unidortmund.de/SOFTWARE/
MYSVM/
137
mySVM and
SVMLight
The above toolbox but written
in/for Visual C++ 6.0.
http://www.cs.ucl.ac.uk
/staff/M.Sewell/svm/
OSU SVM
A Matlab toolbox with the core
part written as MEX code for fast
implementation based on
LibSVM.
http://www.eleceng.ohi
ostate.edu/~maj/osu_sv
m/
WinSVM
An easy to use Windows toolbox
with GUI.
http://liama.ia.ac.cn/Pe
rsonalPage/lbchen/
Torch
A machine learning library written
http://www.torch.ch/
in C++ for large scale datasets.
SVMTorch
SVM for classification and regression on large data sets
based on the torch library.
http://www.idiap.ch/ind
ex.php?content=SVMT
orch&IncFile=PageTyp
e&UrlTemplateType=3
&cPathContenu=pages
/contenuTxt/Projects/T
orch/
Table 10.2: Overview of some toolboxes available for Windows/Linux/Unix implemented
in Matlab and C/C++.
10.5
Overall Summary
This work was intended to give an introduction to how Support Vector Machines can be used in the field of pattern recognition. It has been the goal
to let the reader understand why it works at all and how this is achieved
mathematically. The mathematical background should be understandable
with minor knowledge in the fields of machine learning and optimization.
So all mathematical basics important to understand Support Vector Machines were described in a way such that a person from the technical
branch would be able to do further research in this field without reading all
mathematically written books and papers concerning Support Vector Machines.
This work was not intended to look in all details of the algorithms available
but to get used to the basic ones. So further research could be done especially in the field of Multiclass classification, where the mentioned Weston and Watkins (WW) method showed very good results but is somewhat
very complicated to use. As this work should be readable by beginners in
the field of Support Vector Machines the text was written in a non-high
level mathematical language whereas most or nearly all papers and books
assume very funded knowledge in mathematics.
138
The implemented algorithms both in Matlab and C++ should verify the
theory and they do that but can be extended for sure. So fellow researchers could implement other optimization and multiclass algorithms or extend
the SVM for the regression case.
139
LIST OF FIGURES
1.1
2.1
2.2
3.1
3.2
3.3
4.1
4.2
4.3
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
8.1
8.2
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
9.10
9.11
9.12
9.13
9.14
9.15
9.16
9.17
Multiple decision functions
Shattering of points in the  2
Margin on points reducing hypothesis room
Computer vision
Development steps of a classifier
Example with apples and pears
Convex domains
Convex and concave functions
Local minimum
Vector representation for text
Separating hyperplane
Which separation to choose
Functional margin
Geometric margin
Support Vectors
Slack variables
Decision boundaries
Mapping
Separation of points
Hyperplane three dimensional
Geometric solution
Whole classification procedure
Polynomial kernel
2-layer neural network
Gaussian kernel
Multiclass: OVR
Multiclass: OVO
SMO: The two cases of optimization
Case 1 in detail
Case 2 in detail
Threshold b
Linear kernel and OR
Linear kernel and AND
Linear kernel on XOR
Polynomial kernel on XOR
Polynomial kernel on XOR in 3D
RBF kernel on XOR
RBF on XOR in 3D
RBF on XOR II
RBF on XOR II in 3D
RBF on overlapping data
RBF on overlapping data II
RBF with outliers
RBF with outliers II
13
19
20
22
24
25
28
29
30
38
39
40
41
42
49
52
58
64
67
68
68
72
77
78
79
84
87
94
95
96
104
111
111
112
113
114
115
116
116
117
118
119
120
121
140
9.18
9.19
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8
10.9
10.10
10.11
RBF with outliers III
RBF with outliers IV
Disk structure
GUI of the createdata function for binary case
Report after training
Function svcplot2D
Function svcplot3D
GUI of the createdata function for multiclass case
Function svcplot2D for 4 classes
GUI of the Neural Network Tool
UML diagram for integrated SVM module
GUI of main control interface for the SVM module
Results of a gridsearch for the upper bound C
121
122
124
128
130
131
131
132
133
134
135
136
137
141
LIST OF TABLES
8.1
8.2
10.1
10.2
Multiclass: OVR
Multiclass: OVO
List of implemented Matlab files
List of some available toolboxes
85
87
125
138
142
LITERATURE
[Vap79] V. Vapnik. Estimation of Dependences Based on Empirical Data.
Nauka 1979 (English translation by Springer Verlag, 1982)
[Vap95] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag,
1995
[Vap98] V.Vapnik. Statistical Learning Theory, Wiley, 1998
[Bur98] Burges, C. J. C. A Tutorial on Support Vector Machines for Pattern
Recognition,Kluwer Academic Publishers 1998 (Data Mining and Knowledge
Discovery 2)
[Jah96] J. Jahn. Introduction to the Theory of Nonlinear Optimization, Springer
Verlag, 1996
[Mar00] Marti, Kurt. Einführung in die lineare und nichtlineare Optimierung,
Physica Verlag 2000
[Nel00] Nello Christianini, John Shawe-Taylor. An Introduction to Support Vector Machines and other kernel-based learning methods, Cambridge Press 2000
[Joa98] Joachims. Text Categorization with Support Vector Machines, Learning
with many relevant features, 1998
[Ker01] Keerthi S.S., Shevade, Bhattacharyya, Murthy. Improvements to Platt’s
SMO Algorithm for SVM Classifier Design, Technical Report CD-99-14, National
University of Singapore, 2001
[Pan01] Panu Erästö. Support Vector Machines – Backgrounds and Practice, Dissertation Rolf Nevanlinna Institute Helsinki, 2001
[Cha00] O.Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee. Choosing kernel
parameters for Support Vector Machines, appeared in Machine Learning – Special Issue on Support Vector Machines, 2000
[Lin03] C.-C. Chang, C.-J. Lin. A Practical Guide to Support Vector Classifcation, Paper of the National Taiwan University. See also:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[Kel03] Keerthi S. S., C.-J. Lin. Asymptotic Behaviors of of Support Vector Machines with Gaussian Kernel, in Neural Computation 15(7), 1667-1689, 2003-1120
143
[Lil03] Lin H.-T., C.-J. Lin. A study on Sigmoid Kernels for SVM and the Training of non-PSD kernels by SMO-type methods. Technical report, National University of Taiwan, 2003-11-20
[Krs99] Kressel U. H.-G. Pairwise Classification and Support Vector Machines,
In: Schölkopf B., Burges C. J. C., Smola A. J. Advances in Kernel Methods –
Support Vector Learning, MIT Press, Cambridge, 1999
[Stat] Statnikov A., C. F. Aliferis, Tsamardinos I. Using Support Vector Machines
for Multicategory Cancer Diagnosis Based on Gene Expression Data, Vanderbilt
University, Nashville, TN, USA
[Pcs00] Platt, N. Christianini, J. Shawe-Taylor. Large Margin Dags for MulticlassClassification, Advances in Neural information Processing Systems 12,
MIT Press, 2000
[WeW98] J. Weston, C. Watkins. Multi-Class Support Vector Machines, Technical Report CSD-TR-98-04, Royal Holloway, University of London 1998
[Pla00] John C. Platt. Fast Training of Support Vector Machines using Sequential
Minimal Optimization, Paper from Microsoft Research, Redmond, 2000
144
STATEMENT
1.
Mir ist bekannt, daß die Diplomarbeit als Prüfungsleistung in das Eigentum des Freistaats Bayern übergeht. Hiermit erkläre ich mein
Einverständnis, dass die Fachhochschule Regensburg diese Prüfungsleistung die Studenten der Fachhochschule Regensburg einsehen lassen darf und dass sie die Abschlussarbeit unter Nennung
meines Namens als Urheber veröffentlichen darf.
2.
Ich erkläre hiermit, dass ich diese Diplomarbeit selbständig verfasst,
noch nicht anderweitig für andere Prüfungszwecke vorgelegt, keine
anderen als die angegebenen Quellen und Hilfsmittel benützt sowie
wörtliche und sinngemäße Zitate als solche gekennzeichnet habe.
Regensburg, den 03.03.2004
……………………………….
Unterschrift
145
APPENDIX
A
SVM - APPLICATION EXAMPLES
A.1
Hand-written Digit Recognition
The first real-world task on which Support Vector machines were tested
was the problem of hand-written character recognition. This is a problem
currently used for benchmarking classifiers, originally motivated by the
need of the US Postal Service to automate sorting mail using the handwritten ZIP codes. Different models of SVM have been tested on the freely
available datasets of digits: USPS (United States Postal Service) and
NIST (National Institute for Standard and Technology).
For USPS data, where the input space is 256 dimensional, the following
polynomial and Gaussian kernel were used:
 xy
K ( x, y)  
 256

K ( x, y)  exp( 




d
xy
2
256 2
)
for different values of d and  .
For polynomial kernels, degrees from 1 to 6 have been tested, for Gaussian kernels, values of  between 0.1 and 4.0. The USPS are reported to
be totally separable with a maximal margin machine starting from degree 3
whereas lower values with the 1-norm and 2-norm approach generated
errors.
This whole set of experiments is particularly interesting, because the data
have been extensively studied and there are algorithms that have been
designed specifically for this dataset. The fact that SVM can perform as
well as these systems without including any detailed prior knowledge is
certainly remarkable.
146
A.2
Text Categorization
The task of text categorization is the classification of natural text (or hypertext) documents into a fixed number of predefined categories based on
their content. This problem arises in a number a different areas including
email filtering, web searching, office automation, sorting documents by
topic and classification of news agency stories. Since a document can be
assigned to more than one category this is not a multiclass classification
problem, but can be viewed as a series of binary classification problems,
one for each category.
There are many resources in this field available in the internet, so we
won’t go into detail here. But one interesting work should be noted here
which also led to a library for SVMs with its’ own algorithm:
The text categorization of the Reuters’ News from Joachims with the own
created SVMLight algorithm [Joa98].
147
B LINEAR CLASSIFIERS
B.1 The Perceptron
The first iterative algorithm for learning linear classification is the procedure proposed by Frank Rosenblatt in 1956 for the Perceptron [Nel00].
In the neural network literature another view on the Perceptron is given,
which is mostly more understandable (see figure B.1.1).
x1
w1
x2
∑
.
.
.
.
Wn
1
-1
xn
Figure B.1.1: The neuronal network view on the perceptron for binary classification. The
input vector x = (x1 … xn) is “weighted” by multiplying each element with the corresponding
The algorithm used here is an ‘on-line’ and ‘mistake-driven’ one, because
element of the weight vector w = (w1 … wn). Then the products are added up which is
equivalent to
wx =
w x
i
i
. Last but not least the sum is “classified” by a threshold
function, e.g. here the signum function: class 1 if sum ≥ 0, class 0 otherwise.
The bias is disregarded because of simplification.
It starts with an initial weight vector w0 (usually all zero) and adapts it each
time a training example is misclassified by the current weights.
A fact that needs to be stressed here is, that the weight vector and the bias are updated directly in the algorithm, something that is referred to as
148
the primal form in contrast to an alternative dual representation which will
be introduced below.
The whole procedure used is guaranteed to converge if and only if the
training points are able to be classified by a hyperplane. In this case the
data is said to be linearly separable. If not so the weights (and the bias)
are updated infinitely each time a point is misclassified and so the algorithm isn’t able to converge and only jumps from one instable state to the
next. In this case the data is nonseparable.
For a detailed description of the algorithms see [Nel00].
Given a linearly separable training set S = ((x1, y1), …, (xn, yn)) with
X  n , Y   1,1 , the learning rate     and the initial parameters
w0 = 0, b0 = 0, k = 0
R = max x i
1 i  n
Repeat
For i = 1 to n
If y i ( w k  x i )  bk  0 // mistake
w k 1  w k  y i x i
bk 1  bk  y i R 2
k++
end if
end for
until no mistakes in for loop
Return k, (wk, bk), k is the number of mistakes
Figure B.1.2: The Perceptron Algorithm for training in primal form
The training of figure B.1.2 leads to the following decision function for
some unseen data z, that needs to be classified:
h(z) = sgn( w k  z +bk)
= sgn(  w i zi )
One can see in this algorithm that the perceptron ‘simply’ works by adding
misclassified positive (y = 1) training examples or subtracting misclassified
negative (y = -1 ) ones to an initial weight vector w0.
So, if we assume an initial weight vector as the zero vector, overall the
resulting weight vector is a linear combination of all training points:
149
n
w   i y i x i
(B.1.1)
i 1
with all  i  0 , because the sign is already given by the corresponding yi.
The main property of all  i is, that their value is proportional to the number
of times a misclassification of xi has caused the weight to be updated.
Therefore once the linearly separable training set S has been correctly
classified by the Perceptron and the weight vector has converged to its’
stable state one can think of the newly introduced vector α as an alternative representation of the primal form, the so called dual form in dual coordinates:
f (x)  w  x  b   w i x i  b
 y x
  y x

i
i
i
i
i
x b
i
x b
(B.1.2)
And so the perceptron algorithm can be rewritten in the dual form as
shown in figure B.1.3.
Given a linearly separable training set S = ((x1, y1), …, (xn, yn)) with
X  n , Y   1,1 , the learning rate     and the initial parameters
α = 0, b = 0
R = max x i
1 i  n
Repeat
For i = 1 to n
n
If y i (

j 1
j
y j x j  x i  b)  0 // mistake
i  i  1
b  b  yiR2
end if
end for
until no mistakes in for loop
Return (  ,b) for defining the decision function
Figure B.1.3: The Perceptron Algorithm for training in dual form
150
The learning rate is omitted here, because it only changes the scaling of
the hyperplanes, but does not affect the algorithm with a starting vector of
zero.
Overall the decision function in dual representation for unseen data z is
given by:
h(z) = sgn( w  z +b)
 y x
= sgn(  y x
= sgn(
i
i
i
i
i
 z +b)
i
 z +b)
(B.1.3)
This alternative representation of the primal Perceptron Algorithm and the
corresponding decision function has many interesting and important properties. Firstly the points in the training set which were harder to learn have
larger  i , but the most important thing that needs to be stressed here is
the fact, that the training points xi (and so the unseen points) only enter
the algorithm in form of the inner product x i  x , which will have an
enormous impact on the discussed algorithm(s) used by the Support Vector Machines, there referenced to as a so called Kernel.
151
B.2 A calculated example with the Perceptron Algorithm
The sourcecode for this example in dual form, written in Matlab, can be
obtained here (see also B.1):
Matlab Files\Perceptron\DualPerceptron.m
The already defined workspace variables are here:
Matlab Files\Perceptron\DualPerceptronVariables_OR_AND.mat
For a better understanding of linear separability, we have a look at the
most common used binary functions: AND, OR and XOR.
The calling convention is: [weights bias alphas] = DualPerceptron(X,Y).
OR
x1
x2
x3
x4
x1
0
0
1
1
AND
x2
0
1
0
1
y
-1
1
1
1
x1
x2
x3
x4
0
0
1
1
XOR
y
-1
-1
-1
1
0
1
0
1
x1
x2
x3
x4
x2
x2
x2
1
1
1
x1
1
(y = -1) ;
0
0
1
1
0
1
0
1
y
-1
1
1
-1
x1
1
x1
1
(y = 1)
Figure B.2.1: Examples for linearly separable and non separable data
The OR- and the AND-datasets are both linearly separable while the XORdata cannot be separated by means of one line. In these three cases the
hyperplane is a line, because the inputspace is 2-dimensional (see Chapter 5).
152
Definition B.2.1 (Separability):


A training set S = x1, y 1,...( xl, yl ) : xi   n , yi   1,1 is called separable
by the hyperplane w  x  b  0 , if there exists both a vector w and a
constant b, such that following conditions are always true:
w  x  b  0 for yi = 1
w  x  b  0 for yi = -1
The hyperplane defined by w and b is called a separating hyperplane.
In detail we only calculate the OR case:
After the dual-perceptron-algorithm has converged to its´ stable state, the
vector α consist of (7 3 3 0)’ and the bias has a value of -2.
So now we are able to define the weight vector (see equation B.1.1):
w = 7 * (-1) * (0 0)’ + 3 * 1 * (0 1)’ + 3 * 1 * (1 0)’ + 0 * 1 * (1 1)’ = (3 3)’
The whole function of the hyperplane separating the OR-Dataset, here a
line, is then defined as follows:
3
f ( x )  w  x  b   w i x i  b     x  2  3 x1  3 x 2  2
3
If you test the decision function of B.1.3 with the values x of the just used
OR-table in figure B.1.4, the classification of each point is correct.
E.g. Test of x1 = (0 0)’ and x3 = (1 0)’ :
3 0
sgn(      - 2) = sgn(3*0 + 3*0 - 2) = sgn(-2) = -1
3 0
 3   1
sgn(      - 2) = sgn(3*1 + 3*0 - 2) = sgn(1) = 1
3 0
153
C
CALCULATION EXAMPLES
C.1 Chapter 4

Lagrangian method on a constrained function in two variables and a
graphical way to find a solution:
We search the local extremes of the function
f(x, y) = x2 + 2y2
constrained by
g(x, y) = x + y = 3.
As a first intuition we choose a graphical way to do this:
First draw the constraint into the x-y-plane, then insert the isoquants
(level lines) of the function f and last search level lines, which are
cut by the constraint, to get an approximation where the optimum is.
Isoquants or level lines are defined as seen in figure C.1.1.
Figure C.1.1: The function f(x, y) = e
x2
* e  y and the corresponding level lines
2
154
The above technique is shown in figure C.1.2.
Figure C.1.2: A graphical solution to a function in 2 variables with one equality constraint
And now the solution with the Lagrangian method. As seen in chapter 4, the Lagrangian for a objective function f(x, y) in two variables
with one constraint g(x, y) = c is defined as:
L( x, y , )  f ( x, y )   (c  g ( x, y ))
The necessary conditions for a optimal solution can then be stated
as (find stationary point(s)):

L( x, y , )  L x  f x  g x  0
x

L( x, y , )  L y  f y  g y  0
y

L( x, y , )  L  c  g ( x, y )  0

Therefore the example can be reformulated in that way:
L( x, y, )  ( x 2  2y 2 )   (3  x  y )
And to find the stationary point(s):
155
Lx  2x    0
Ly  4y    0
L  3  x  y  0
This (linear) system of equalities has following solution:
x = 2, y = 1 und  = 4.
So the only stationary point of f(x, y) constrained by g(x, y) is
x0 = (2; 1).

Lagrangian method on a constrained function in three variables and
two constraints.
We search the stationary points of the function
f(x, y, z) = ( x  1) 2  ( y  2) 2  2z 2
constrained by
x  2y  2 and y  x  3
Recall the generalized Lagrangian function for equality constraints
in chapter 3:
k
L( x1... x n ; 1... k )  f ( x 1... x n )    i (c i  g i ( x 1... x n ))
i 1
for a function f of n variables and k equality constraints gi of the
form g i ( x1... x n )  c i .
So the Lagrangian function for the example is:
L( x, y, z,,  )  (( x  1) 2  ( y  2) 2  2z 2 )   (2  x  2y )   (3  y  z)
And the conditions for stationary points of L can be stated as:
L x  2( x  1)    0
L y  2( y  2)  2    0
L z  4z    0
L  2  x  2y  0
L  3  y  z  0
156
And again we get a (linear) system with 5 unknowns in 5 variables,
which can be easily solved and get as the only solution:
6
10
11
26
44
x   ;y 
; z   ;  
; 
7
7
7
7
7
And so the only stationary point of f(x, y, z) with above constraints is
6 10 11
x 0  (  ; ; ) .
7 7
7
C.2 Chapter 5

Equation 5.1:
f (x)  w  x  b  w i x i  b
 1
With w =   , b = - 3 and x =
3


 2
  :
5
 1  2 
f ( x )        3  1 * 2  3 * 5  3  15
3 5
Definition 5.1 (Margin):
Normalisation of w and b by
1
1
w=
w and b =
b.
w
w
 2
With w =   , b = - 3:
5
w  2 2  5 2  29  w norm
 2
 
3
5

; bnorm 
29
29
So w norm  bnorm  1.
157
e.g: w norm 
4 25 29


1
29 29 29
1
In words normalising means scaling a vector to a length of 1, e.g.  
1
can be seen as the diagonal in the unit quadrangle and therefore has
1
1
a length of 2 , which is the same as   . So scaling by
" length"
1
performs the step..
158
D
SMO PSEUDO CODES
D.1
Pseudo Code of original SMO
target = desired output vector
point = training point matrix
procedure takeStep(i1,i2)
if (i1 == i2) return 0
alph1 = Lagrange multiplier for i1
y1 = target[i1]
E1 = SVM output on point[i1] - y1 (check in error cache)
m = y1*y2
Compute L, H
if (L == H)
return 0
k11 = kernel(point[i1],point[i1])
k12 = kernel(point[i1],point[i2])
k22 = kernel(point[i2],point[i2])
eta = 2*k12-k11-k22
if (eta < 0)
{
a2 = alph2 - y2*(E1-E2)/eta
if (a2 < L) a2 = L
else if (a2 > H) a2 = H
}
else
{
Lobj = objective function at a2=L
Hobj = objective function at a2=H
if (Lobj > Hobj+eps)
a2 = L
else if (Lobj < Hobj-eps)
a2 = H
else
a2 = alph2
}
if (a2 < 1e-8)
a2 = 0
159
else if (a2 > C-1e-8)
a2 = C
i
f (|a2-alph2| < eps*(a2+alph2+eps))
return 0
a1 = alph1+m*(alph2-a2)
Update threshold to reflect change in Lagrange multipliers
Update weight vector to reflect change in a1 & a2, if linear SVM
Update error cache using new Lagrange multipliers
Store a1 in the alpha array
Store a2 in the alpha array
return 1
endprocedure
procedure examineExample(i2)
y2 = target[i2]
alph2 = Lagrange multiplier for i2
E2 = SVM output on point[i2] - y2 (check in error cache)
r2 = E2*y2
if ((r2 < -tol && alph2 < C) || (r2 > tol && alph2 > 0))
{
if (number of non-zero & non-C alpha > 1)
{
i1 = result of second choice heuristic
if takeStep(i1,i2)
return 1
}
loop over all non-zero and non-C alpha, starting at random point
{
i1 = identity of current alpha
if takeStep(i1,i2)
return 1
}
loop over all possible i1, starting at a random point
{
i1 = loop variable
if takeStep(i1,i2)
return 1
}
}
return 0
endprocedure
160
main routine:
initialize alpha array to all zero
initialize threshold to zero
numChanged = 0;
examineAll = 1;
while (numChanged > 0 | examineAll)
{
numChanged = 0;
if (examineAll)
loop I over all training examples
numChanged += examineExample(I)
else
loop I over examples where alpha is not 0 & not C
numChanged += examineExample(I)
if (examineAll == 1)
examineAll = 0
else if (numChanged == 0)
examineAll = 1
}
D.2
Pseudo Code of Keerthi’s improved SMO
target = desired output vector
point = training point matrix
fcache = cache vector for Fi values
% Note: The definition of Fi is different from the Ei in Platt’s SMO algorithm.
% The Fi does not subtract any threshold.
procedure takeStep(i1, i2)
% Much of this procedure is same as in Platt’s original SMO pseudo code
if (i1 == i2) return 0
alph1 = Lagrange multiplier for i1
y1 = target[i1]
F1 = fcache[i1]
m = y1*y2
Compute L, H
If (L == H) return 0
161
K11 = kernel(point[i1], point[i1])
K12 = kernel(point[i1], point[i2])
K22 = kernel (point[i2], point[i2])
eta = 2*K12-K11-K22
if (eta < 0)
{
a2 = alph2 – y2*(F1-F2)/eta
if (a2 < L) a2 = L
else if (a2 > H) a2 = H
}
else
{
Lobj = objective function at a2=L
Hobj = objective function at a2=H
If (Lobj > Hobj+eps)
a2 = L
else if (Lobj < Hobj - eps)
a2 = H
else
a2 = alph2
}
if ( |a2-alph2| < eps*(a2+alph2+eps) )
return 0
a1 = alph1+m*(alph2-a2)
Update weight vector to reflect change in a1 & a2, if linear SVM
Update fcache[i] for I in I_0 using new Lagrange multipliers
Store a1 and a2 in the alpha array
% The update below is simply achieved by keeping and updating infor% mation about alpha_i being 0, C or in between them. Using this to% gether with target[i] gives information as to which index set I belongs
Update I_0, I_1, I_2, I_3 and I_4
% Compute updated F values for i1 and i2 …
fcache[i1] = F1 + y1*(a1-alph1)*k11 + y2*(a2-alph2)*k12
fcache[i2] = F2 + y1*(a1-alph1)*k12 + y2(a2-alph2)*k22
Compute (i_low, b_low) and (i_up, b_up) by applying equations (A) and (B)
using only i1, i2 and indices in I_0
return 1
endprocedure
162
procedure examineExample(i2)
y2 = target[i2]
alph 2 = Lagrange multiplier for i2
if (i2 is in I_0)
{
F2 = fcache[i2]
}
else
{
compute F2 = F_i2 and set fcache[i2] = F2
% Update (b_low, i_low) or (b_up, i_up) using (F2, i2) …
if ((i2 is in I_1 or I_2) && (F2 < b_up))
b_up = F2, i_up = i2
else if ((i2 is in I_3 or I_4) && (F2 > b_low))
b_low = F2, i_low = i2
}
% Check optimality using current b_low and b_up and, if violated, find an
% index i1 to do joint optimization with i2 ….
optimality = 1
if (i2 is in I_0, I_1 or I_2)
{
if (b_low – F2 > 2*tol)
optimality = 0, i1 = i_low
}
if (i2 is in I_0, I_3 or I_4)
{
if (F2 – b_up > 2*tol)
optimality = 0, i1 = i_up
}
if (optimality == 1)
return 0
% For i2 in I_0 choose the better i1 …
if (i2 is in I_0)
{
if (b_low – F2 > F2 – b_up)
i1 = i_low
else
i1 = i_up
}
if takeStep(i1, i2)
return 1
else
163
return 0
endprocedure
main routine for Modification 1 (same as SMO):
initialize alpha array to all zero
initialize b_up = -1, i_up to any index of class 1
initialize b_low = 1, i_low to any index of class 2
set fcache[i_low] = 1 and fcache[i_up] = -1
numChanged = 0;
examineAll = 1;
while (numChanged > 0 | examineAll)
{
numChanged = 0;
if (examineAll)
{
loop I over all training examples
numChanged += examineExample(I)
}
else
{
loop I over I_0
numChanged += examineExample(I)
% It is easy to check if optimality on I_0 is attained …
if (b_up > b_low – 2*tol) at any I
exit the loop after setting numChanged = 0
}
if (examineAll == 1)
examineAll = 0
else if (numChanged == 0)
examineAll = 1
}
main routine for Modification 2:
initialize alpha array to all zero
initialize b_up = -1, i_up to any index of class 1
initialize b_low = 1, i_low to any index of class 2
set fcache[i_low] = 1 and fcache[i_up] = -1
numChanged = 0;
examineAll = 1;
while (numChanged > 0 | examineAll)
{
numChanged = 0;
164
if (examineAll)
{
loop I over all training examples
numChanged += examineExample(I)
}
else
%
%
%
%
%
The following loop is the only difference between the two SMO modifications. Whereas, in modification 1, the inner loop selects i2 from I_0
sequentially, here i2 is always set to the current i_low and i1 is set to
the current i_up; clearly, this corresponds to choosing the worst violating pair using members of I_0 and some other indices.
{
inner_loop_success = 1;
do until ((b_up > b_low-2*tol) | inner_loop_success == 0)
{
i2 = i_low
y2 = target(i2)
alph2 = Lagrange multiplier for i2
F2 = fcache[i2]
Inner_loop_success = takeStep(i_up, i_low)
numChanged += inner_loop_success
}
numChanged = 0
}
if (examineAll == 1)
examineAll = 0
else if (numChanged == 0)
examineAll = 1
}
165
Download