Offline and Online SVM Performance Analysis

advertisement
Offline and Online SVM Performance Analysis
by
Kathy F Chen
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February 2007
© Massachusetts Institute of Technology 2007. All rights reserved.
A uthor ....................................
Department of Electrical Engineering andt<omputer Science
December 8, 2006
.........................
Una-May O'Reilly
Principle Researcher
Thesis Supervisor
C ertified by
Accepted by.
Arthur U. smith
Chairman, Department Committee on Graduate Theses
MASSACHUSETTS INSTI
TyjE
OF TECHNoLGY
OCr 0 32007
LIBRARIES
BARKER
2
Offline and Online SVM Performance Analysis
by
Kathy F Chen
Submitted to the Department of Electrical Engineering and Computer Science
on December 8, 2006, in partial fulfillment of the
requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
To understand and evaluate the performance of a machine learning algorithm, the Support Vector
Machine, this thesis compares the strengths and weaknesses between the offline and online SVM.
The work includes the performance comparisons of SVMLight and LaSVM, with results of training time, number of support vectors, kernel evaluations, and test accuracies. Multiple datasets
are experimented to cover a wide range of input data and training problems. Overall, the online
LaSVM has trained with less time and returned comparable test accuracies than SVMLight. A
general breakdown of the two algorithms and their computation efforts are included for detailed
analysis.
Thesis Supervisor: Una-May O'Reilly
Title: Principle Researcher
3
4
Acknowledgments
I would like to thank my thesis advisor, Una-May O'Reilly, for guiding me through the research
process and for providing constant support throughout the unfolding of the mysterious SVM algorithms. I felt truly lucky to have an advisor like her and have learned tremendously from our
weekly meetings and discussions.
I would also like to thank my friends and family for always being there for me when I feel
discouraged. They have given me the strength to continue and the will to finish this important task.
This project was made possible by the funding support from DARPA IPTO and Architecture
for Cognitive Information Processing (ACIP) program. Special thanks also go to Janice McMahon
of USC/ISI and Chris Archer of Northrop Grumman.
5
6
Contents
1
13
Introduction
15
2 Background
SVM Overview . . . . . . . . . . . . . .
15
Kernel Functions . . . . . . . . .
18
2.2
SVMLight . . . . . . . . . . . . . . . . .
18
2.3
Online SVM . . . . . . . . . . . . . . . .
21
2.1
2.1.1
2.3.1
2.4
3
Sequential Minimal Optimization
21
. . . . . . . . . . . . . . . . . .
22
2.4.1
Overview . . . . . . . . . . . . .
22
2.4.2
Selection Strategy
. . . . . . . .
22
2.4.3
Process and Reprocess . . . . . .
23
2.4.4
Termination Condition . . . . . .
24
LaSVM
27
SVM Experiments
3.1
Experimental Setup . . . . . . . . . . . .
27
3.2
MNIST Dataset . . . . . . . . . . . . . .
30
3.3
Income Prediction . . . . . . . . . . . . .
33
3.4
Webpage Classification . . . . . . . . . .
36
3.5
Face Detection
. . . . . . . . . . . . . .
38
3.6
CEARCH Entity Classification . . . . . .
7
40
4
5
Performance and Tradeoff
45
4.1
Cross-sectional Dataset Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2
Procedure Count Analysis
4.3
Sensitivity to Tolerance Parameter 6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
. . . . . . . . . . . . . . . . . . . . . . . . .
Summary
5.1
Future Work.
47
49
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8
List of Figures
. ...
. . . . . . . . . . . . . . . . . . . . . . . . 16
2-1
SVM exam ple . . . . . . . ..
2-2
SVM with outliers ........
2-3
SVMLight High Level Code Flow ..........................
2-4
LaSVM Pseudo code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3-1
Sim ulation map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3-2
MNIST training time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3-3
MNIST number of support vectors . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3-4
M NIST test error
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3-5
MNIST training time VS number of support vectors . . . . . . . . . . . . . . . . .
32
3-6
MNIST training time VS cache size
3-7
Income training time comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3-8
Training time vs sample size
3-9
Income number of support vectors
...................................
17
20
. . . . . . . . . . . . . . . . . . . . . . . . . 32
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
. . . . . . . . . . . . . . . . . . . . . . . . . . 34
. . . . . . . . . . . . . . . . . . . . . .
34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
. . . . . . . . . . . . . . . . . . . . . . . . . .
35
3-15 Webpage training time comparison . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3-10 Number of support vectors vs sample size
3-11 Income test accuracies
3-12 Test accuracy vs sample size
3-13 Income kernel evaluations
3-14 Kernel evaluations vs sample size
3-16 Training time vs sample size
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3-17 Webpage number of support vectors
. . . . . . . . . . . . . . . . . . . . . . . . . 37
3-18 Number of support vectors vs sample size . . . . . . . . . . . . . . . . . . . . . .
9
37
3-19 Webpage test accuracies
. . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 37
3-20 Test accuracy vs sample size . . . . . . .
. . . . . . . . . . . . . . . . . . . . 37
3-21 Webpage kernel evaluations . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 37
3-22 Kernel evaluations vs sample size
. . . .
. . . . . . . . . . . . . . . . . . . . 37
3-23 Face Detection training time comparison .
. . . . . . . . . . . . . . . . . . . . 38
3-24 Training time vs Sample size . . . . . . .
. . . . . . . . . . . . . . . . . . . . 38
3-25 Face detection number of support vector .
. . . . . . . . . . . . . . . . . . . . 39
3-26 Number of support vectors vs Sample size
. . . . . . . . . . . . . . . . . . . . 39
3-27 Face detection: test accuracies
. . . . . .
. . . . . . . . . . . . . . . . . . . . 39
3-28 Test accuracy vs Sample size . . . . . . .
. . . . . . . . . . . . . . . . . . . . 39
3-29 Face detection: kernel evaluations
. . . .
. . . . . . . . . . . . . . . . . . . . 39
3-30 Kernel evaluations vs Sample size . . . .
. . . . . . . . . . . . . . . . . . . . 39
3-31 Test accuracy vs gamma
. . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41
3-32 Test accuracy vs c . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41
3-33 Test accuracy vs weight . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41
3-34 CEARCH training time comparison
. . .
. . . . . . . . . . . . . . . . . . . . 42
3-35 Training time vs sample size . . . . . . .
. . . . . . . . . . . . . . . . . . . . 42
3-36 CEARCH number of support vectors . . .
. . . . . . . . . . . . . . . . . . . . 42
3-37 Number of support vectors vs sample size
. . . . . . . . . . . . . . . . . . . . 42
3-38 CEARCH test accuracies . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 43
3-39 Test accuracy vs sample size . . . . . . .
. . . . . . . . . . . . . . . . . . . . 43
4-1
Top 20 processes of SVMLight . . . . . .
. . . . . . . . . . . . .
46
4-2
Top 20 processes of LaSVM
. . . . . . . . . . . . .
46
4-3
Income: Accuracy vs Sample size
. . . .
. . . . . . . . . . . . . 47
4-4
Income: Accuracy vs Sample size
. . . .
. . . . . . . . . . . . . 47
4-5
Income: Accuracy vs sample size . . . . .
. . . . . . . . . . . . . 48
. . . . . . .
10
List of Tables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.1
Dataset summary
3.2
MNIST dataset details
3.3
MNIST summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4
Income dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.5
Income summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.6
Webpage dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.7
Webpage summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.8
Face detection dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.9
Face detection summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.10 CEARCH dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.11 CEARCH entity summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
. . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1
SVMLight performance summary
4.2
LaSVM performance summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3
Top performing algorithms for each dataset
11
. . . . . . . . . . . . . . . . . . . . . 45
12
Chapter 1
Introduction
Support Vector Machines [9] are being applied extensively to classification problems in many
applications, ranging from handwriting recognition [1], text categorization [5] to face detection
[6], and produce promising results.
Developed by Vapnik, SVM aims to maximize the classification margin of a hyperplane boundary determined by a subset of training points called support vectors.
In his paper [4], Joachims described the approach of SVMLight, a software implementation
of the SVM algorithm in C. SVMLight is among one of the most widely used SVM classification tools, and has been shown to solve large-scale datasets efficiently through decomposing the
Quadratic programming (QP) problem into smaller subproblems.
Similar to SVMLight's approach, Platt proposed the Sequential Minimization Optimization(SMO)
method in [11], which breaks up the problem in the smallest possible subset of 2 datapoints. The
problem can then be solved analytically without a QP solver. An online approach that incorporates
the essence of both the SMO and the incremental approach[7] is the LaSVM method, presented in
[10]. The LaSVM method updates the SVM model by inserting and removing appropriate support
vectors over the entire training process.
This thesis will explore the performance of both offline and online SVM, using the algorithms
provided in SVMLight and LaSVM. Chapter 2 will describe their background details; Chapter
3 contains the results of multiple experiments on a wide variety of datasets. Chapter 4 covers a
cross-sectional dataset study of both algorithms and their procedure details.
13
14
Chapter 2
Background
The section will give an overview of the SVM algorithm and discuss the use of kernel functions.
The second section will describe the implementation details of SVMLight by Thorsten Joachims,
and explain the list of parameters which can be optimized based on the scope of a problem. The
third section will give an overview of online SVM and discuss some of the key differences from
the offline approach.
2.1
SVM Overview
Let the training data be
(x 1 , yI), ..., (xi, yi), x E R, y
E +1,
(2.1)
-1.
The Support Vector Machines learn to classify the data points by finding a set of Lagrange
multipliers a,, a 2 ,...,a for the n-dimensional training points in a set of size 1 such that there is a
maximum margin in between two separate classes marked by y = +
and y = -1.
The training
points with non-zero a values are called the support vectors, that determine the optimal hyperplane
[2][4][9].
A simple example in 2 dimensions can be seen in Figure 2-1. Two classes with features of x 1
and X2 are classified by a hyperplane (solid line) with the largest margin determined by the support
vectors on the dashed lines.
15
x2
IdI
Figure 2-1: SVM example
The decision function based on the trained model is thus the following function,
f (x) = sign(w -x + b),
(2.2)
where w - x is a linear combination of all the support vectors. The training data that are not
support vectors will have zero a values and do not have any effect on the decision function.
In order to find the hyperplane that satisfies the maximum margin requirement, the problem is
transformed into a Quadratic programming problem. First, one class (black points) are denoted
with +1, and the other class (white points) are denoted with -1. Given a vector w which is perpendicular to the hyperplane from the origin, the class boundaries can be represented as the following
equations after normalization.
Class +1 boundary (dashed line):
W.
Class -1 boundary (dashed line):
W
+=X+
(2.3)
-x = -1
(2.4)
The margin M is the distance from the +1 boundary to the -1 boundary.
M
=
x+ _ X-1
2
(2.5)
(2.6)
Given training data with multidimensional features, the SVM algorithm maximizes the mar16
gin between two classes by solving for the global minimum in a convex quadratic programming
problem. Quadratic Programming is a mathematical optimization problem which minimizes a
non-linear quadratic cost function subject to linear equality or inequality constraints. The resulting
QP:
min
(2.7)
IIw112
w,b2
s81. yi -((w - xi) + b) ;> 1,
1, ..
(2.8)
For training data that are not linearly separable, we can relax our constraints by allowing data
points to lie on the opposite side of the hyperplane with a predetermined penalty C. The distances
of wrongly classified points from the correct boundary are represented by a slack variable E, and
the objective function of the QP transforms to the following equation:
min
w,b
2
1
R
IwI + C7&,
where R is the total number of datapoints misclassified, and
(2.9)
k
is the kth point's distance away
from the decision boundary.
In Figure 2-2, there are three outliers that lie on the wrong side of the hyperplane. The value of
C determines the trade-off between a generalized model and perfect classification. For problems
with outliers or noise, we often want to utilize the "Soft margin" approach to avoid over-fitting.
x2
: 1
Figure 2-2: SVM with outliers
17
2.1.1
Kernel Functions
To classify non-linearly separable data, SVM first applies kernel functions which map the feature
data into another finite or infinite space. The non-separable data in a lower dimension may now be
separable with a hyperplane in the new higher dimension.
Below is a list of commonly used kernel functions. In practice, the choice of kernel and kernel
parameters are often optimized using cross-validation.
" Linear Kernel
K(x, x') = (xTx')
(2.10)
K(x, ') = (1 + (xTx') P
(2.11)
" Polynomial Kernel
* Radial Basis (Gaussian) Kernel
K(x, x')
exp
-
2 lix - X'11)
(2.12)
* Sigmoid Kernel
K(x, x')
2.2
tanh(,(x, x) + 19)
(2.13)
SVMLight
Training a large-scale SVM classifier often challenges memory and training time resources. In
Joachim's paper [5], he described the SVMLight which incorporates several techniques to make
training large-scale classifiers more efficient and practical.
The SVMLight algorithm includes the following techniques:
* decomposition strategy (Osuna's active set)
" select the appropriate variables for the working set (Zoutendijk's method)
" shrinking the QP problem
18
* software caching of the Hessian (2nd derivative) computation
The algorithm first decomposes the original QP problem to a smaller subproblem by selecting
the appropriate working set based on the Zoutendijk's method. The method tries to find a direction
d which is the steepest decent direction with q non-zero elements. As discussed in [5], the working
set selection is in itself another QP problem, which can be solved by restricting q to be an even
number, and select each q/2 largest elements from the positive and negative weights.
After finding the appropriate working set, only the weights for the working set are updated
while the rest remain fixed in the current iteration. This smaller QP subproblem can then be solved
with a nonlinear interior-point solver [8].
Besides the decomposition and working set selection mentioned above, SVMLight also incorporates the idea of shrinking. The main purpose of shrinking is to identify the bounded support
vectors using heuristics and remove them from the training set in order to reduce the size of the
QP problem.
Since the most computation-intensive operation in training the SVM is the Hessian computation, the use of a software cache saves repetitive matrix computation and reduces the training
time.
There are many parameters associated with SVM and also in particular with the SVMLight
implementation.
" C : The parameter C is a measure of the trade-off between training error and margin. For a
large C, the trained model may have a near-perfect separation of the two classes; however,
the disadvantage may be the small margin size between the class boundaries and a poor
generalization performance of the model. In the QP problem, C is also the upper bound
of the Lagrange multipliers a,, & 2 ,...,Z 1 for the support vectors as the following inequality
constraints,
0 < aO < C.
(2.14)
* q : maximum size of QP-subproblem. The parameter q determines the size of the working
set, which contains the variables that are updated in the current iteration.
19
* m : size of cache for kernel evaluation in MB. The parameter m specifies the size of the
cache used to store the precomputed kernel computations.
E c: allowable tolerance for QP termination condition. In SVMLight, the value is the amount
of error allowed while training the model to satisfy the inequality constraint in Eq. (2.8),
yi , ((w - xi) + b) - 1 ;> e,
svmlearnclassification()
//train the svm
f
optimize to convergence
optimizeto_convergenceQ
(2.15)
/Do optimization on the working set
optimin-svM() (
compute matrices for optimizationo; // makes calls to kernel
optimizeqp0; # hildrethdespo method
reassign variables accordigly;
(
while (not convergent)
clear working set;
updatelinearcomponest()(
if(inear case) (
clearvector no;
for( each element in the working set)
addvector ns0;
sprod-nso;
/ selecting working set.
//(random set once in 100 times.)
if (counter++% 101) (
select next qp subproblem grado;
else (
selectnextqpsubproblem-randO;
cachemultiple-kernel-rows0;
optimize_svm0;
updatelinear componentQ;
calculatesvm_model();
check optimalityo;
if (seemingly convergent)
//general case
else
--
(
//Make all variables active again which had been removed by
// shrinking computes im for those variables from scratch
reactivateinactIveexamples(){
if(linear case){
clearvector n0;
for( each element in the working set) (
addvector nsQ;
sprod-s
reactivateinactive_examplesQ;
if(counter++ %10)(
shrink_problem
if (num of sv > max element in cache)(
kemel cache shrink
/general case
else
free variables for the next iteration;
Figure 2-3: SVMLight High Level Code Flow
20
0
2.3
Online SVM
In this thesis, an online learning algorithm is defined to be an algorithm that allows multiple incremental updates of a model as training examples are processed. There may be other definitions of
online algorithms, such as ones that label incoming training data while learning or others that use
the accuracies of trained models to update their decision boundaries. The following sections focus
on the incremental-model definition and will explain the basic setup of the online SVM described
in the LaSVM paper [10], including the criterion for selecting and removing support vectors, and
the termination condition.
2.3.1
Sequential Minimal Optimization
In order to solve for the global optimum in the SVM QP problem, LaSVM utilizes the Sequential
Minimal Optimization (SMO) proposed in [11].
Rather than solving for a large QP problem in the training step, SMO solves sequentially on
only two datapoints without increasing the value of dual objective function.
The SMO algorithm consists of three main components. The first component is an analytic
method to solve for the subproblem of two Lagrange multipliers. The second component is a
heuristic for picking which two multipliers to optimize. The third component is a method to
compute the bias b, which in SVMLight is calculated from the linear components as in Figure 2-3.
The analytic method that solves for the two Lagrange multipliers can be described as a hill
climbing algorithm where the multiplier values are updated iteratively to satisfy the Karush-KuhnTucker(KKT) conditions.
In order to optimize the value for the objective function, SMO first selects one Lagrange multiplier that violates the KKT condition. The other multiplier is then picked based on the maximal
improvement for the value of the objective function.
The bias b is recomputed after every iteration of the analytic method.
21
2.4
LaSVM
2.4.1
Overview
LaSVM combines the techniques of Sequential Minimal Optimization and support vector removal.
The resulting algorithm has been shown to use less memory and train faster than traditional SVM
solvers. Especially for handling noisy data, where an increasing number of support vectors implies
more computation cost, LaSVM aims to remove unnecessary support vectors to avoid overfitting
and to lower kernel computation requirements.
LaSVM implements SMO using a practical method suggested in [15]. The method defines a
small positive tolerance r, and the search direction of SMO is determined by a T-violating pair
where the following conditions hold true.
Point i and Point j is a T-violating pair if:
&e
<
Bi
cej > Aj
gi - gj
>
7
In the initialization, LaSVM picks the first five samples from each class and assigns them
as support vectors through the Process procedure. Next, for every epoch, LaSVM selects multiple
data points until all points have been selected or until the termination condition is met. The order of
the data points selected depends on a predetermined selection strategy: random, gradient or active
selection. When the particular data point is chosen, LaSVM then calls the Process procedure,
which inserts the data point as a support vector under constraints. For each insertion process, there
is a removal Reprocess procedure, where the algorithm removes any non-support-vector datapoints
from the model.
2.4.2
Selection Strategy
The online LaSVM has three selection strategies for picking the next data point for Process: random, gradient and active selection.
22
Random Selection
The random selection picks a random sample from the data points that have not yet been processed.
Gradient Selection
The gradient selection takes in a user-specified candidate number as input. The goal is to pick the
best gradient among the candidates. The program will sequentially pick out random samples from
the unprocessed data points, and select the data point with the maximal gradient. The gradient here
means the gradient of the dual objective function. Having a maximal gradient means the largest
improvement for satisfying the objective function. The gradient values are predicted by computing
the kernel expansion on the selected datapoints.
The goal of the gradient selection is, therefore, to pick the most misclassified sample in the
hope of reducing the error of the objective function as much as possible.
Active Selection
Comparing to the aggressive correction of the gradient selection, the active selection takes a more
conservative approach by picking points that are misclassified but lie close to the decision boundary. This may be especially useful for noisy datasets where the most misclassified points are
usually exceptions and should be ignored.
The active selection picks the next candidate point by first computing the margin of all the
candidates points from the current boundary, and then selecting the one closest to the margin for
the process step.
2.4.3
Process and Reprocess
Process
The Process operation's goal is to recompute the alpha and gradient values after inserting the
selected data point as a support vector.
The operation can be described into three main steps. The first step determines a datapoint to
be processed as one of the potential support vectors. The selected sample is used as a new training
23
point for the model. The second step selects a corresponding support vector, which will form the
T-violating pair with the new added sample, and the pair will have the maximal gradient. The third
step will update the new weight of every support vector.
Reprocess
The Reprocess operation removes some elements from the support vector set where the alphas are
equal to zero.
The operation first determines a T-violating pair with the maximal gradient, next it performs
a direction search and recomputes the alpha weight. At the end of the computation, if any of the
previous support vector now has an alpha weight of zero, it will be removed from the set. The bias
term and the new gradient will also be recomputed at the end.
2.4.4
Termination Condition
Each epoch of the online training terminates when all samples have been selected for the processing
step. The current implementation also consists of three types of termination criterion, which are
based on number of iterations, number of support vectors or the training time.
24
-1
Figure 2-4 shows the pseudocode of LaSVM. The main loop starts by selecting 5 samples of
each class as support vectors, and iterates through the datapoints in each epoch. The algorithm
then selects a fresh sample datapoint by its selection strategy, and add it into the support vector set.
Next, the reprocess operation will attempt to remove non-support-vector point from the set, and
update the bias term and gradient values. If the termination condition is met at the end of the loop,
reprocess operation will be called until there are zero T-violating pairs.
train_online () {
select ()
case RANDOM //pick a random candidate
case GRADIENT //pick best gradient from candidates
case MARGIN /pick closest to margin from candidates
select 5 examples of each class as support vectors
for (every epoch) {
for (every point in training set) {
lasvmnprocess
lasvmprocess
()
check if already in expansion as sv
select
if (deltamax <=1000) {
lasvm reprocess
while (error > deltamax)
lasvm_reprocess
compute gradient
insert
perform a direction search
-,
-.
(
lasvm_reprocess ()
find maximal gradient
perform a direction search
remove non-support vector
compute new bias and gradient values
if (termination condition met) {
repeat reprocess until gradient < t
)
Figure 2-4: LaSVM Pseudo code
25
26
Chapter 3
SVM Experiments
This chapter compares the performance of offline SVM (SVMLight) and online SVM (LaSVM)
on five different datasets. They are MNIST handwritten digits classification, UCI's income prediction, webpage classification, face detection and CEARCH entity classification problems. The
experiments were run on a 3.20GHz Intel machine with 2GB available cache memory.
The first four datasets are chosen from published paper, and they cover a wide range of input
data formats and training problems. An additional entity classification dataset is compiled from a
simulation tool developed by Northrop Grumman.
3.1
Experimental Setup
Dataset
MNIST
Income
Webpage
Face
CEARCH
Train Size
60000
32562
49749
6977
18782
I
,
Test Size
10000
16282
38994
24045
100
Feature
780
123
300
363
10
Gamma
Tradeoff ci
0.005
10
0.005
0.005
1
5
0.01
10
10/ 7
0.0001/0.01
Cache(MB)
256
80
80
80
100
Working Set(q)
10
2
2
20
10
Table 3.1: Dataset summary
MNIST Setup
The MNIST experiments compare the performance of SVMLight and LaSVM with one epoch and
two epochs. The associated parameters are selected to be equal for a fair comparison. Besides
27
algorithm specific parameters like working set size of 10 for SVMLight and random selection
scheme for LaSVM, both methods were tested with a RBF kernel with gamma=0.005 and tradeoff
constant c=10 with 256MB cache.
Income Prediction Setup
Similar to the MNIST digit classification experiments, the training parameters were set equal for
both the SVMLight and LaSVM methods. All training runs uses the RBF kernel with gamma=0.005,
tradeoff constant c=1, and 80MB cache.
Webpage Classification Setup
In the webpage classification experiment, the parameters for training the algorithms are the following: RBF kernel with gamma=0.005, tradeoff constant c=5 and cache size of 80MB. The working
set for the SVMLight is chosen to be q=2, and LaSVM has the random selection scheme for both
epoch of 1 and 2.
Face Detection Setup
All the experiments of the face detection dataset have the same training parameters of a RBF kernel
with gamma = 0.01, tradeoff constant c=10, and 80MB cache. The working set q is set to 20 for
SVMLight.
CEARCH Setup
The Distributed Sensing Test Bed [12] developed by Northrop Grumman is a simulation tool for
modeling a two-dimensional entity movements observed by platforms with sensors. The sensor
reports are generated whenever an entity moves within the specified sensor range with the appropriate sensor dimension. At the same timestamps, there are also truth reports which log the true
positions and velocities of all entities with no sensor noise.
The simulation tool takes in two XML input files, which specify the simulation parameters
including the entity type, platform speed, sensor dimension, sensor range and the region map. The
entity, platform, and map are all implemented as c++ classes.
28
T I I 11T I I I ITI I I ITI I I ITI I 1111 11tft
Figure 3-1: Simulation map
Entity - An entity can have one or more signature dimensions and a specified behavior. The
signature values are Gaussian distributed with a mean and a standard deviation. The behavior can
be random or specified trajectories along the wait points.
Platform - Each platform can have multiple sensors which detect any entities within the sensor
range. A platform moves across the region with a random behavior or a commandable route,
provided by the C2FusionAlgorithm class.
Map - The map specifies the world in which the platforms and entities operate. There are
different map layers, including the infrastructure of roads and parking space, the building structure
of houses and stores, and also the natural features like ponds.
The simulation used the exact scenario file provided by Northrup Grumman (scenario-cearch.xml),
with the increased running time to get a larger dataset (cearch_ _big). There are a total of 630 entities (50 types), and 20 platforms (10 types). The config file (dstb-cearch_1.xml) kept the entity
information the same, and increased the platform's sensor dimensions from 1 to 10. The goal is to
have 10 simultaneous sensor reports from a single entity. All 10 reports will be used as additional
features for training the SVM models. The added sensors on the same platform have the same
sensor range and noise as the existing sensor on that platform. The data-size associated with the
29
signature, however, remained the same for the same signature dimension across all 10 platforms
Two simulations are run with the same random seed 5147. The cearch-la simulation lasts for
10 time units, and the cearch-labig simulation lasts for 60.12 time units with 0.02 resolution.
Out of the 50 types of entities, only the first two have the behavior of going from house to
stores, and the rest of the entity types have random behaviors. Entities in type 1 are actuated with
a consistent move behavior, and entities in type 2 have inconsistent route behaviors. The platforms
are commandable and have zig-zagging behaviors surveying the simulation area.
3.2
MNIST Dataset
The MNIST handwritten digit (0-9) dataset consists of 60,000 training samples and 10,000 testing
samples. The features are 780 gray-level pixel values ranging from 0 to 1. In order to classify each
digit, we train the SVM with the digit class as +1 and the rest of nine classes as -1. The positive
and negative samples of each digit are summarized in Table 3.2.
Label
0
1
2
3
4
5
6
7
8
9
train(+1)
5923
6742
5958
6131
5842
5421
5918
6265
5851
5949
train(-1)
54077
53258
54042
53869
54158
54579
54082
53735
54149
54051
test(+1)
980
1135
1032
1010
982
892
958
1028
974
1009
test(-1)
9020
8865
8968
8990
9018
9108
9042
8972
9026
8991
Table 3.2: MNIST dataset details
In Figure 3-2, across all 10 digits, LaSVM with one epoch takes the least amount of training
time. LaSVM with two epochs also trains faster than SVMLight.
In Figure 3-3, all three methods have comparable numbers of support vectors. Running LaSVM
with one epoch returns the least numbers of support vectors overall. Among the digits, label 0 and
1 were the easiest to train because of the few support vectors required, and label 8 and 9 were the
hardest. Similar relationship also reflects in the training time.
In Figure 3-4, all three methods have comparable test errors among the 10 digits. Surprisingly
training LaSVM with two epochs does not return a significantly higher accuracy for the extra
training time.
30
MNIST Label vs Training Time
(randern salecftkr kern4e, 256mb cache, ganmma-M.O6, 1000candnshingstep)
(rtbfkern. 256mb cache, workkngsel -10,tradeeff c-10, amrma-nOUS)
0 LaSVM (xl)
U LaSVM (x2)
aSVMLight
S.
7
90.00
200.00
600.00
400.00
800.00
1000.00
1200.00
Training time (s)
Figure 3-2: MNIST training time
MNIST Label vs Number of SV
(random selnctlan, rhfkernei, 25enb cache,
(rbfkernel, 256n* cache, wordne set -1,
10
10man-In.O5,
cud, fladldnah step)
tradeoffc=10, gamma-0.005)
0
2
3
4
O LaSVM (xl)
*LaSVM (x2)
M SVMLIgh
7
8
9
1000
0
2000
Number of Sv
4000
3000
Figure 3-3: MNIST number of support vectors
MNIST Label vs Test Error
(randomo
(rt
elacftn.rbftkenel, 256b cache, gunia-tIAn, 1000 cand, hlahig step)
kene, 258mb cache, working as -10,tradeoffc-10,awnma-0.lOS)
3
0 LaSVM (x1)
_j
E
0 LaSVM (x2)
O SVMLight
5
7
9
0.00
0.20
0.40
0.60
0.80
Test Error (%)
Figure 3-4: MNIST test error
31
1.00
In Figure 3-5 LaSVM with one epoch has the smallest training time to support vector ratio, and
SVMLight has the highest ratio.
Training Time vs Number of SV
25n* e
Ime.4d..
Vk...d
,
upbkwnK 250,0,eh.,eWon"
sWemO.
,
00.lUOe..ndi.,g
W-10, lrdwM1 v-1O.9eemt-O05)
si*)
12001000
800
* LaSVM (xl)
MSVMUight
A LaSVM (x2)
600
400
200
0
0.00
500.00 1000.00 1500.00 2000.00 2500.00 3000.00 3500.00
Number of Support Vectors
Figure 3-5: MNIST training time VS number of support vectors
In Figure 3-6, reducing the available cache size significantly affects the training time of SVMLight because of the requirement of Quadratic Programming. In SVMLight, the increased training
time is spent on recalculating the kernels not in the smaller cache, whereas the LaSVM has a much
higher tolerance for cache as low as 16MB.
1100
1
'
SVMLight
LaSVM(x1)
1000
Soo
LaSVM(x2)
.
600
700
E
/
~600
500
/
400
300
200
100
1024
512
256
16
32
64
126
Kernel cache size(MB)
8
4
2
Figure 3-6: MNIST training time VS cache size
32
Algorithms
SVMLight
LaSVM(xl)
LaSVm(x2)
Training time
Number of SVs
X
X
Accuracy
Table 3.3: MNIST summary
3.3
Income Prediction
The goal of the income dataset is to predict whether a household has an income greater than
$50,000 based on 123 binary features. The feature dimensions are derived from 14 attributes, and
8 out of 14 are categorical binary values and the rest are continuous attributes, discretized into
intervals. The training datasets were split into increasing sample sizes to analyze its effect on the
training performance. The details of the training and testing dataset are shown in Table 3.4.
Examples
1605
2265
3185
4781
6414
11221
16101
22697
32562
train(+1)
395
572
773
1188
1569
2692
3918
5506
7841
train(-1)
1210
1693
2412
3593
4845
8528
12182
17190
24721
test(+1)
3846
3846
3846
3846
3846
3846
3846
3846
3846
test(-1)
12436
12436
12436
12436
12436
12436
12436
12436
12436
Table 3.4: Income dataset details
The following figures show the results of training time, number of support vectors and test
accuracies. Across the different training sample sizes, LaSVM with one epoch trains faster than
SVMLight, except the largest dataset a9a, and LaSVM with two epochs trains the slowest among
the three experiments in Figure 3-8. All three of them returns similar numbers of support vectors in
Figure 3-10. For the extra training time, LaSVM with two epochs did not return any improvement
of test accuracies and performs the worst. Overall, the three algorithms perform equally on training
time, number of support vectors, and even accuracy. In the largest dataset a9a, SVMLight has the
fastest training time and a similar test accuracy as two-epoch LaSVM.
33
SVMLight
--
400
Examples
1605 (ala)
2265 (a2a)
3185 (a3a)
4781 (a4a)
6414 (a5a)
11221 (a6a)
16101 (a7a)
22697 (a8a)
32562 (a9a)
SVMLight
0.47
0.94
1.75
3.91
7.07
24.54
57.38
110.3
227.96
LaSVM(xl)
0.26
0.55
1.10
2.41
4.65
16.51
41.72
103.95
248.59
LaSVM(x2)
0.52
1.08
2.09
4.90
9.34
35.45
85.43
196.52
442.24
-
LSaSVM(xl)
-
-LaSVM(x2)
.
350
300
250
..
.0200
150
100
so
Figure 3-7: Income training time comparison
0.5
0
1
2
1.5
Number of examples
3.5
3
2.5
X10
Figure 3-8: Training time vs sample size
---
12000
---+-'
Examples
1605 (ala)
2265 (a2a)
3185 (a3a)
4781 (a4a)
6414 (a5a)
11221 (a6a)
16101 (a7a)
22697 (a8a)
32562 (a9a)
SVMLight
786 (bsv:757)
1105 (bsv:1082)
1450 (bsv:1412)
2090 (bsv:2051)
2710 (bsv:2661)
4478 (bsv:4424)
6322 (bsv:6252)
8731 (bsv:8668)
12162 (bsv:12082)
LaSVM(xl)
767
1085
1419
2044
2662
4408
6226
8604
12040
LaSVM(x2)
784
1102
1443
2081
2700
4467
6304
8707
12142
M
SVMLIghI
LeSVM(x1)
LaSVM(x2)
10000-
8000
m
M 60000
4000
2000
0
0.5
1
Figure 3-9: Income number of support vectors
1.5
Number
2
3.5
a
2.5
of examples
x1O
Figure 3-10: Number of support vectors vs
sample size
Examples
1605 (ala)
2265 (a2a)
3185 (a3a)
4781 (a4a)
6414 (a5a)
11221 (a6a)
16101 (a7a)
22697 (a8a)
32562 (a9a)
SVMLight
82.39%
83.11%
83.28%
83.48%
83.74%
83.79%
84.02%
84.18%
84.35%
LaSVM(xl)
83.04%
83.54%
83.58%
83.59%
83.74%
84.14%
84.24%
84.35%
84.26%
LaSVM(x2)
82.42%
83.09%
83.27%
83.48%
83.72%
83.75%
84.02%
84.15%
84.34%
04
03.5
-e-- SVMLigh1
02.51
0
Figure 3-11: Income test accuracies
LaSVM~x1)
-~--
0.5
1
1.5
2
Number of examples
2.5
2.5
~
LaSVM(x2)
3.5
3
3
10
Figure 3-12: Test accuracy vs sample size
34
Figure 3-14 also shows the direct relationship between kernel evaluations and training time.
Since the bulk of the SVM training depends on the kernel evaluations, the more kernel evaluations
an algorithm requires, the longer the training time tends to take.
(10
-
5
Examples
1605 (ala)
2265 (a2a)
3185 (a3a)
4781 (a4a)
6414 (a5a)
11221 (a6a)
16101 (a7a)
22697 (a8a)
32562 (a9a)
SVMLight
1009656
1993157
3762116
8231945
14474588
50842630
109866178
221425039
453929799
LaSVM(xl)
653448
1270049
2583480
5091158
9170704
25503523
56062076
136418730
310893163
LaSVM(x2)
1316311
2632005
4945027
10796052
19057417
56208622
120879722
263881292
563633998
-SVMLight
SLaSVM(xl)
SLaSVM(x2)
4
E 2
0
Figure 3-13: Income kernel evaluations
--
0.5
1
1.5
2
Number of examples
3.5
2.5
S10
Figure 3-14: Kernel evaluations vs sample
size
Algorithms
SVMLight
LaSVM(xl)
LaSVm(x2)
Training time
Number of SVs
=
=
Table 3.5: Income summary
35
Accuracy
3.4
Webpage Classification
The webpage dataset prepared by John Platt is an experiment on text categorization. The goal is
to train the SVM with the keywords extracted from a webpage, and classify whether the webpage
belongs to a certain category. There are 300 keyword features in a training set of 49749 webpages.
Similar to the income prediction problem, the training set samples are split into multiple training
sizes from wla to w9a.
Examples
2477(w1a)
3470(w2a)
4912(w3a)
7366(w4a)
9888(w5a)
17188(w6a)
24692(w7a)
49749(w8a)
train(+1)
72
107
143
216
281
525
740
1479
train(-1)
2405
3363
4769
7150
9607
16663
23952
48270
test(+1)
1094
1094
1094
1094
1094
1094
1094
1094
test(-1)
37900
37900
37900
37900
37900
37900
37900
37900
Table 3.6: Webpage dataset details
The training time of the three approaches is shown in Figure 3-16, and LaSVM with one epoch
takes the least amount of time, while two epochs take close to twice as much time if not more to
train. Interestingly, even though LaSVM trains fastest, it also has the highest number of support
vectors (Figure 3-18).
200
Examples
2477(w1a)
3470(w2a)
4912(w3a)
7366(w4a)
9888(w5a)
17188(w6a)
24692(w7a)
49749(w8a)
SVMLight
0.29
0.53
0.85
1.79
3.05
9.31
20.13
80.12
LaSVM(xl)
0.16
0.29
0.57
1.12
2.31
5.87
12.94
72.87
LaSVM(x2)
0.41
0.74
1.36
2.79
3.91
14.32
33.54
188.90
180
-1
_
-
160
-+ ---
SVMLight
aSVM()
LaSVM(x2)
140
120
100
80
0
Figure 3-15: Webpage training time comparison
0
0.5
1
1.5
2
2.5
3
3.5
Number of examples
4
5
4.5
X10
Figure 3-16: Training time vs sample size
Overall, SVMLight is more accurate and has fewer number of support vectors.
36
4000
---3500
Examples
2477(wla)
3470(w2a)
4912(w3a)
7366(w4a)
9888(w5a)
17188(w6a)
24692(w7a)
49749(w8a)
SVMLight
224 (bsv:102)
300 (bsv: 160)
360 (bsv:220)
506 (bsv:349)
628 (bsv:460)
1068 (bsv:836)
1433 (bsv: 1164)
2510 (bsv:2147)
LaSVM(xl)
282
367
410
598
612
1438
2056
3830
-
SVMLight
+
L
aSVM~x1)
3000
LaSVM(x2)
232
279
345
503
616
1046
1458
2836
2500
2000
1500
Z
1000500
Figure 3-17: Webpage number of support vectors
00.5
1
1.5
3
2
x.5
Number of examples
3.5
4
4.5
5
x10
Figure 3-18: Number of support vectors vs
sample size
90.0
Examples
2477(wla)
3470(w2a)
4912(w3a)
7366(w4a)
9888(w5a)
17188(w6a)
24692(w7a)
49749(w8a)
SVMLight
97.44%
97.46%
97.58%
97.87%
97.97%
98.34%
98.48%
98.79%
LaSVM(xl)
97.45%
97.54%
97.70%
97.88%
97.90%
98.35%
98.33%
98.52%
LaSVM(x2)
97.45%
97.46%
97.53%
97.73%
97.79%
98.09%
98.16%
98.41%
90.4
90e
98.2
98
97.0
-- +-SVM",ghl
___LaSVM(x1)_
/
97.0
4--LaSVMWk)
---
Figure 3-19: Webpage test accuracies
5
0.5
1
1.5
2.5
3
2
Number of examples
3.5
4
4.5
5
Figure 3-20: Test accuracy vs sample size
-_x10
-0--
Examples
2477(wla)
3470(w2a)
4912(w3a)
7366(w4a)
9888(w5a)
17188(w6a)
24692(w7a)
49749(w8a)
SVMLight
642643
1183361
2002387
4238842
7095305
21019709
45002780
172741988
LaSVM(xl)
456728
870890
1653750
3083272
6238597
14728017
31948290
104253875
LaSVM(x2)
1225663
2186374
4100866
7990208
10574545
37232710
82493882
314426160
SVMLIOOI
~LaSVMpo1)
LaS VM(x2)
3
2.5
1.5
I
0.
Figure 3-21: Webpage kernel evaluations
~~0~
0
0.5
1
1.5
2
2.5
3
Number of examples
3.5
4
4.5
5
X1i
Figure 3-22: Kernel evaluations vs sample
size
37
Algorithms
SVMLight
LaSVM(xl)
LaSVm(x2)
Training time
Accuracy
X
Number of SVs
X
X
Table 3.7: Webpage summary
3.5
Face Detection
The goal of this experiment is to classify whether or not an image contains a face. The features
are the 361 pixels in the 19x19 images, and the values range from 0 to 1. The SVM models were
trained with each dataset and evaluated on the same testing samples (Figure 3.8).
Examples
435 (faceO)
871 (face l)
1744 (face2)
3488 (face3)
6977 (face4)
train(+1)
151
303
607
1214
2429
test(-1)
23573
23573
23573
23573
23573
test(+1)
472
472
472
472
472
train(-1)
284
568
1137
2274
4548
Table 3.8: Face detection dataset details
LaSVM with one epoch consistently has the least training time, and SVMLight lies in between
(Figure 3-24). As can be seen in the following figures, the one-epoch LaSVM also has the smallest
number of support vectors in average, yet with a comparable performance as SVMLight. In this
example, the two-epoch version also does not gain any improvements for the test accuracies.
An interesting observation can be made by comparing Figure 3-24 and 3-30. Even though
SVMLight has kernel evaluations counts similar to the two-epoch LaSVM, the training time is
significantly less. The may be due to the efficiency of a large working set of 20.
6 -
Examples
435 (faceO)
871 (facel)
1744 (face2)
3488 (face3)
6977 (face4)
SVMLight
0.14
0.41
1.01
2.95
8.14
LaSVM(xl)
0.07
0.23
0.69
2.51
7.88
-6-
SVMLight
---
LaSVM(x1)
LaSVMOs2)
1.
LaSVM(x2)_
0.14
0.54
1.65
6.03
19.36
Si
2
-
0
4
2
Figure 3-23: Face Detection training time comparison
0
1000
2000
4000
3000
Number of examples
5000
6000
7000
Figure 3-24: Training time vs Sample size
38
450
5 VMLight
-
400
-
LaSVM(x1)
e-- 4--
LaSVM(x2)
.
350
Examples
435 (face0)
871 (facel)
1744 (face2)
3488 (face3)
6977 (face4)
SVMLight
99 (bsv:3)
156 (bsv:7)
208 (bsv:9)
313 (bsv:30)
449 (bsv:47)
LaSVM(xl)
97
143
189
288
409
LaSVM(x2)
99
155
206
307
44
300
250200
0
z
150
Figure 3-25: Face detection number of support
vector
100
3000 r000 4000
Number of examples
1000
0
6000
5000
7000
Figure 3-26: Number of support vectors vs
Sample size
QA q
so
Examples
435 (faceO)
871 (facel)
1744 (face2)
3488 (face3)
6977 (face4)
SVMLight
95.91%
96.82%
97.74%
98.08%
98.29%
LaSVM(xi)
95.90%
96.80%
97.74%
98.07%
98.34%
97.5
LaSVM(x2)
95.91%
96.80%
97.74%
98.08%
98.29%
97-
SVMLIght
-
06
Figure 3-27: Face detection: test accuracies
0
1000
2000
3000
4000
Number of examples
6000
5000
7000
Figure 3-28: Test accuracy vs Sample size
)r10
4
SVMLIght
.5 -
LaSVM(x1)
+
4
LaSVMWx2)
7'
cc 3.5
Examples
435 (faceO)
871 (facel)
1744 (face2)
3488 (face3)
6977 (face4)
SVMLight
86142
242288
589349
1701520
4644224
LaSVM(xl)
32953
94665
246205
732452
2035821
LaSVM(x2)
62166
208801
559261
1665282
4892865
3
1 .5
u2
.5
Figure 3-29: Face detection: kernel evaluations
0
1000
2000
4000
3000
Number of examples
5000
6000
7000
Figure 3-30: Kernel evaluations vs Sample
size
39
Algorithms
SVMLight
LaSVM(xI)
Training time
Number of SVs
X
X
Accuracy
LaSVm(x2)
Table 3.9: Face detection summary
3.6
CEARCH Entity Classification
In order to determine the optimal parameters to train both algorithms, a series of experiments were
run on a smaller dataset of size 4060, and a testing dataset of 16 with 8 samples of each class. After
the parameter tuning process, the algorithms were evaluated across multiple datasets of increasing
sizes from 2000 to 18782 (Table 3.10).
Examples
2000
4000
6000
8000
10000
15000
18782
train(+1)
761
761
761
761
761
761
761
train(-1)
1239
3239
5239
7239
9239
14239
18021
test(+1)
50
50
50
50
50
50
50
test(-1)
50
50
50
50
50
50
50
Table 3.10: CEARCH dataset details
The initial training parameter is chosen to have a RBF kernel with tradeoff constant of 10 and
cache of 100MB. Due to the small size of positive samples in our dataset, we have initialized the
weight to be the ratio between the negative and positive sample at 66.7. The working set size q
was set to 2.
Figure 3-31 shows the test accuracies varying the value of gamma. In the case of SVMLight,
the smaller the gamma values, the higher the test accuracies. Hence, a low gamma of 0.0001 is
chosen for the subsequent training. Surprisingly, LaSVM outputs a slightly different result with the
maximum test accuracy at gamma=0.01. Overall, the average LaSVM test accuracy is significantly
lower than the SVMLight by more than 10%.
After tuning for the kernel parameter (gamma), we continued to experiment with the tradeoff
constant c (Figure 3-32). SVMLight returns a high accuracy of 93.75% for changing c values
within the range of 2 to 6, while LaSVM returns the maximal accuracy at c=7. Again, LaSVM
returns a much lower testing performance overall.
Figure 3-33 confirms our choice for the weight of the positive class, as both SVMLight and
LaSVM returns a reasonably high accuracies for w=66.7 among other weight values.
40
85 -LaSVM
|
85
I
70
4guma
0.0 1
6580
55
S47
.
Kernl voAje
gamma
Figure 3-31: Test accuracy vs gamma
90
...
c-7
65
55-
VMjght
LaSVM
0
2
4
6
8
12
10
Tradeoff vale c
14
16
18
20
Figure 3-32: Test accuracy vs c
e-SVIAught
0
/
75
-
-
7--
*
--
-
-70
55
0
20
40
s0
85
WeIght
100
120
of positN Clas"
140
160
180
200
Figure 3-33: Test accuracy vs weight
41
After determining the optimal parameter values for gamma, tradeoff c and the weight of positive class, the final experiments focus on the performance across different sizes of training datasets.
While SVMLight improves its accuracies given more training data, LaSVM's performance degraded, which is an opposite behavior from all the previous datasets. We do not know the exact
reason; however, one speculation is the parameters may be chosen to fit the smaller dataset and
not tailored for larger sizes. Though more experiments may need to be taken place before a clear
explanation.
25
Examples
2000
4000
6000
8000
10000
15000
18782
SVMiLight
1.34
4.86
9.82
17.35
183.67
249.23
256.32
0 -
LaSVM(xl)
0.57
1.82
3.54
4.98
6.94
9.86
13.172
20
-SVMLight
LaSVM
15 0
10 0
50
Figure 3-34: CEARCH training time comparison
0.2
0.4
0.6
0.8
1.4
1-2
1
Number of examples
2
1.6
1.6
x10
Figure 3-35: Training time vs sample size
900 0
7
Soo '0
700 0
Examples
2000
4000
6000
8000
10000
15000
18782
SVMLight
961
2175
3274
4235
5192
7265
8837
LaSVM(xl)
955
1232
1482
1633
1767
1957
2066
e00 0
--
SVMVLIght
LaSVM
500 0
400
"
0 200
100
Figure 3-36: CEARCH number of support vectors
0.2
0.4
0.6
0.0
1.4
1.2
1
Number of examples
1.8
2
1.8
10
Figure 3-37: Number of support vectors vs
sample size
42
111111
5
Examples
2000
4000
6000
8000
10000
15000
18782
SVMtLight
75%
78%
79%
74%
76%
85%
87%
80
LaSVM(xl)
76%
68%
69%
60%
59%
55%
54%
5 75
-
SV, VLight
-+-LaSVM
\
70
65
60
55
Figure 3-38: CEARCH test accuracies
0
0.2
0.4
0.6
0.8
1
1.2
Number of examples
1.4
1.6
1.8
2
010
Figure 3-39: Test accuracy vs sample size
Algorithms
SVMLight
LaSVM(xl)
LaSVm(x2)
Training time
Number of SVs
X
X
Accuracy
X
Table 3.11: CEARCH entity summary
43
44
Chapter 4
Performance and Tradeoff
4.1
Cross-sectional Dataset Comparison
Dataset
MNNIST
Income
Webpage
Face
CEARCH
Training Time
660.69
227.96
80.12
8.14
198.35
SVMLight
Kernel Counts
1.6E+08
4.54E+08
1.73E+08
4.64E+06
2.IE+08
SV
2086
12162
2510
449
9147
BSV
804
12082
2147
47
9097
Accuracy(%)
99.55
84.35
98.79
98.29
85.00
Table 4.1: SVMLight performance summary
Dataset
MNIST
Income
Webpage
Face
CEARCH
Training Time
171.51
248.59
72.87
7.88
12.83
LaSVM
Kernel Counts
6.94E+07
3.11E+08
1.04E+08
2.04E+06
2.54E+07
SV
1811
12040
3830
409
2056
Accuracy(%)
99.57
84.26
98.52
98.34
53.10
Table 4.2: LaSVM performance summary
Dataset
MNIST
Income
Webpage
Face
CEARCH
Training time
LaSVM(xl)
SVMLight or LaSVM(xI)
LaSVM(xl)
LaSVM(xl)
LaSVM(xl)
Number of SVs
LaSVM(xl)
All
SVMLight or LaSVM(x2)
LaSVM(xl)
LaSVm(xl)
Accuracy
All
All
SVMLight or LaSVM(x1)
All
SVMLight
Table 4.3: Top performing algorithms for each dataset
One interesting observation is the relatively fast training time for both the Webpage and Face
dataset, this can very well be due to their large tradeoff c.
45
4.2
Procedure Count Analysis
In order to compare the two SVM algorithms closely on the exact procedure runs during the training process, we used a program called proccount to record the total number of calls and instructions. The SVM models were trained on the income prediction dataset using the same training
parameters (RBF kernel with gamma=0.005, c=1, cache=80MB and q=2).
38.64% kernel
13.48%, kernel
vqp
6.73%. kernel
6.31% kernel
6.61 % kern el
2.50% kenel
2.3%krnel
0.84%. kernel
calculatesym_model
compute_matricesfor_optimization
_0vfscanf internal
_EW6.get_pc_thunk.bx
reactivate_inactive_e xamples
shrinkproblern
selectnextqpsubproblern_rand
svmilearn
learn
libc.so 6
sym
libmro_6
svmrlearn
symlearn
symlearn
795E+03
7.95E+03
2 36E+06
4.53E+08
1 00E+00
7 94E402
7 80E+01
1.92E+09
1.13E+09
9.20E +08
9.07E+08
7 78E+08
4 24E408
1 97E+t8
0.59%
0.35%
0.28%
0.28%
10
0.28% kernel
0.24%
0.13%
0.06%
qp
76.92%
Figure 4-1: Top 20 processes of SVMLight
57.17% kernel
9.49%. kemel
9.07% kernel
4.22% kernel
process
process
0.42% kernel
0
80.37%
Figure 4-2: Top 20 processes of LaSVM
Among the top 20 procedures for both SVMLight and LaSVM, more than 70 to 80 percent of
instructions are kernel evaluations (represented as yellow rows in 4-1 and 4-2). Overall, LaSVM
46
also have much fewer calls and instructions than SVMLight, which reflects the characteristic of the
LaSVM algorithm solving 2 points at any given time, rather than solving for the entire QP problem
like the SVMLight.
4.3
Sensitivity to Tolerance Parameter E
The performance of the experiments on 5 datasets are affected by different parameters in the algorithms. In order to verify the robustness of an algorithm to varying parameters, the following
section explore the algorithm sensitivity to the tolerance parameter e. See Section 2.2 Page 20 for
the role of the termination condition E in both offline and online SVM algorithms.
Below is an experiment varying the termination condition on both LaSVM and SVMLight; the
section will describe how the accuracy performances change. The dataset is the Income dataset,
and the original experiments used tol=0.001.
Figure 4-3 shows the effect of varying tolerance (tol = 0.1 and 0.01) on one-epoch LaSVM.
The accuracies of one-epoch LaSVM have a large variance, and the two-epoch LaSVM in Figure
4-4 stays consistent with the default tolerance (tol = 0.001).
Figure 4-5 shows the effect of varying tolerance (tol = 0.1 and 0.01) on SVMLight. The results
stay consistent with the SVMLight having default tolerance (tol = 0.001).
04.5
84.5
84
84
83.5-
83
83(
--
82.5
-
SVMLight (tol-0.001)
-SVMLght
- -+-LaSVM (x1,tol-0.001)
LaSVM (x1,t-0.1)
___BLaSVM (x1,tol-0.01)
0
05
1
2
1.5
Number of examples
25
3
82.5
-
----
+LaSVM
0
3.5
0.5
1
2
1.5
Number of examples
(tol-0.001)
-LaSVM (x2,tol-0.001)
LaSVM (x2,tol.0.1)
2.5
(x2,tol-0.01)
3
3.5
10
Figure 4-3: Income: Accuracy vs Sample Figure 4-4: Income: Accuracy vs Sample
size
size
The tolerance experiments show that SVMLight may be more robust to changes in the toler47
84.5
84
83
SVMLight
82.5
(ol-0.001)
SVMLIght (0ol-0.1)
SSVMVLight (Ool-0.01)
82
0
0.5
1
1.5
Number
2
of examples
2.5
3.5
3
x 10
Figure 4-5: Income: Accuracy vs sample
size
ance setting, where one-epoch LaSVM performs less consistently. Comparing to the one-epoch
LaSVM, the two-epoch trials generate more stable test accuracies, similar to SVMLight. One interesting aspect that is worth mentioning is the fact one-epoch LaSVM still returns the highest
accuracies for all three tolerance values (tol = 0.1, 0.01 and 0.001).
48
Chapter 5
Summary
This thesis compares the strengths and weaknesses between the offline and online SVM. The work
includes the performance comparisons of SVMLight and LaSVM, with results of training time,
number of support vectors, kernel evaluations and test accuracies. Overall, the online approach
of LaSVM has trained with less time and returned comparable test accuracies than SVMLight.
The offline SVMLight; however, is more robust to varying tolerance parameter than one-epoch
LaSVM.
5.1
Future Work
The current experiments only include the random selection scheme for the LaSVM algorithm.
The alternate scheme like gradient and active selection may provide different training time and
accuracies tradeoff among the multiple datasets.
The current online LaSVM algorithm does not offer an approach to unsupervised training,
which may be an added strength, combining with its fast training speed. The potential extension
can be applied to realtime feedback adjustment and allow the model to change with respect to new
incoming data.
49
50
Bibliography
[1] C. Bahlmann, B. Haasdonk, and H. Burkhardt. On-line Handwriting Recognition with Support
Vector Machines-A Kernel Approach. In Proc. Of the 8th IWFHR, pages 49-54, 2002.
[2] C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. In Data Mining
and Knowledge Discovery 1998.
[3] I. Guyon, A. Elisseeff. An Introduction to Variable and Feature Selection. In Journal of Machine Learning Research, 2003.
[4] T. Joachims. Making Large-Scale SVM Learning Practical. In Advances in Kernel Methods,
B.Scholkopf C. Burges, A. Smola, Cambridge, MIT Press, 1998.
[5] T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In European Conference on Machine Learning (ECML), 1998.
[6] E. Osuna, R. Freund, and F. Girosi. Support Vector Machines: Training and Applications. A.I.
Memo 1602, MIT A. I. Lab., 1997.
[7] G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In Advances in Neural ProcessingSystems, 2001.
[8] R. Vanderbei. Loqo: An Interior Point Code for Quadratic Programming. Technical Report
SOR 94-15, Princeton University, 1994.
[9] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
[10] A. Bordes, S. Ertekin, J. Weston, and Leon Bottou. Fast Kernel Classifiers with Online and
Active Learning. In Journal of Machine Learning Research, 2005.
51
[11] J. Platt. Fast Training of Support Vector Machines Using Sequential Minimal Optimization.
In Advances in Kernel Methods, B.Scholkopf C. Burges, A. Smola, Cambridge, MIT Press,
1998.
[12] Distributed Sensing Test Bed (DSTB) User's/Developer's Guide
[13] C. Cortes, and V. Vapnik. Support Vector Networks. In Machine Learning, 20:273-297, 1995.
[14] J. Weston, S. Mukherjee, 0. Chapelle, M. Pontil, T. Poggio, V. Vapnik. Feature Selection for
SVMs.
[15] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. Technical report,
Computer Science and Information Engineering, National Taiwan University, 2001-2004.
52
Download