Concept Boundary Detection for Speeding up SVMs

advertisement
Concept Boundary Detection for Speeding up SVMs
panda@cs.ucsb.edu
Navneet Panda
Dept. of Computer Science, UCSB, CA 93106, USA
Edward Y. Chang
Dept. of Electrical and Computer Engg., UCSB, CA 93106, USA
echang@ece.ucsb.edu
Gang Wu
Dept. of Electrical and Computer Engg., UCSB, CA 93106, USA
gwu@engineering.ucsb.edu
Abstract
Support Vector Machines (SVMs) suffer from
an O(n2 ) training cost, where n denotes the
number of training instances. In this paper,
we propose an algorithm to select boundary
instances as training data to substantially reduce n. Our proposed algorithm is motivated
by the result of (Burges, 1999) that, removing non-support vectors from the training set
does not change SVM training results. Our
algorithm eliminates instances that are likely
to be non-support vectors. In the conceptindependent preprocessing step of our algorithm, we prepare nearest-neighbor lists for
training instances. In the concept-specific
sampling step, we can then effectively select
useful training data for each target concept.
Empirical studies show our algorithm to be
effective in reducing n, outperforming other
competing downsampling algorithms without
significantly compromising testing accuracy.
1. Introduction
Support Vector Machines (SVMs) (Vapnik, 1995) are
a core machine learning technology. They enjoy strong
theoretical foundations and excellent empirical successes in many pattern-recognition applications. Unfortunately, SVMs do not scale well with respect to the
size of training data. Given n training instances, the
time to train an SVM model is about O(n2 ). Consider
applications such as intrusion detection, video surveillance, and spam filtering, where a classifier must be
Appearing in Proceedings of the 23 rd International Conference on Machine Learning, Pittsburgh, PA, 2006. Copyright 2006 by the author(s)/owner(s).
trained quickly for a new target concept, and on a
large set of training data. The O(n2 ) training time
can be excessive rendering SVMs an impractical solution. Even when training can be performed offline,
when the amount of training data and the number
of target classes are large (e.g., the number of document/image/video classes on desktops or on the Web),
the O(n2 ) computational complexity is not acceptable.
The goal of this work is to develop techniques to reduce n for speeding up concept-learning without degrading accuracy. (We contrast our approach with
others’ in Section 2.) We propose a technique to identify instances close to the boundary between classes.
(Burges, 1999) shows that if all non-support vectors
are removed from the training set, the training result
is exactly the same as that of using the whole training
set. Our boundary-detection algorithm aims to eliminate instances most likely to be non-support vectors.
Using only the boundary instances can thus substantially reduce the training data size, and at the same
time achieve our accuracy objective.
Our boundary-detection algorithm consists of two
steps: concept-independent preprocessing, and conceptspecific sampling. The concept-independent preprocessing step identifies the neighbors for each instance.
This step incurs a one-time cost, at worst O(n2 ), and
can absorb insertions and deletions of instances without the need to reprocess the training dataset. Once
the preprocessing has been completed, the conceptspecific sampling step prepares training data for each
target concept. For a given concept, this step determines its boundary instances, reducing the training
data size substantially. Empirical studies show that
our boundary-detection algorithm can significantly reduce training time without significantly affecting classprediction accuracy.
Concept Boundary Detection for Speeding up SVMs
2. Related Work
Prior work in speeding up SVMs can be categorized
into two approaches: data-processing and algorithmic.
The data-processing approach focuses on training-data
selection to reduce n. The algorithmic approach devises algorithms to make the QP solver faster (e.g.,
(Osuna et al., 1997; Joachims, 1998; Platt, 1998;
Chang & Lin, 2001; Fine & Scheinberg, 2001)). Our
approach falls in the category of data processing, and
hence we discuss related work only in this category.
To reduce n, one straightforward data-processing
method is to randomly down-sample the training
dataset. Bagging (Breiman, 1996) is a representative sampling method, which enjoys strong theoretical
backing and empirical success. However, the random
sampling of bagging may not select the most effective training subset for each of its bags. Furthermore,
achieving high testing accuracy usually requires a large
number of bags. Such a strategy may not be very productive in reducing training time.
The cascade SVMs (Graf et al., 2005) are another
down-sampling method. It randomly divides training
data into a number of subsets, with each one being
trained using one SVM. Instead of training only once,
the cascade SVMs uses a hierarchical training structure. At each level of the hierarchy, the training data
are the support vectors obtained in the previous iteration or hierarchical level. The same strategy is repeated until the final level (with only one SVM classifier) is reached. The support vectors obtained at the
final level are then cycled to the leaf nodes hence completing the cycle. The advantage of cascade SVMs over
the bagging method is that it may have fewer support
vectors at the end of the training, favorably impacting
the classification speed. Both cascade SVMs and bagging can benefit from parallelization but so can aspects
of the approach developed in this paper.
Instead of performing random sampling on the training set, some “intelligent” sampling techniques have
been proposed to perform down-sampling (Smola &
Schölkopf, 2000; Pavlov et al., 2000; Tresp, 2001).
Recently, (Yu et al., 2003) uses a hierarchical microclustering technique to capture the training instances
that are close to the decision boundary. However,
since the hierarchical micro-clusters would not be isomorphic to the high-dimensional feature space, such a
strategy can be used only for a linear kernel function
(Yu et al., 2003). Another method is Information Vector Machines (IVMs) (Lawrence et al., 2003), which
attempts to address the issue by using posterior probability estimates of the reduction in entropy to choose
a subset of instances for training.
Our boundary-detection algorithm complements the
algorithmic approach (such as SMO) for speeding up
SVMs. Comparing our approach to representative
data-processing algorithms, we show empirically that
our boundary-detection algorithm achieves markedly
better speedup over both bagging and IVMs.
3. Boundary Identification
Our proposed algorithm comprises two stages:
concept-independent preprocessing and concept-specific
sampling. The first step is concept-independent, thus
allowing for learning different concepts at a later stage.
The second step is concerned with the actual selection
of a relevant subset of the instances once the concept
of interest has been identified. Following the selection
of the subset, the relevant learning algorithm (SVMs)
is applied to obtain the classifier.
We assume the availability of a large corpus of training instances for preprocessing. This does not preclude
the addition or removal of instances from the dataset,
as we explain at the end of Section 3.1. The same
dataset is used for learning multiple concepts defined
by varying labels. For each concept, a classifier must
be learned to classify unseen instances. Thus, if we use
SVMs as the classifier algorithm, multiple hyperplanes
would need to be learned, one for each class. Essentially, the second step dictates the reuse of the training
dataset with varying labels. Such a scenario is common in data repositories where the data is available for
preprocessing with periodic updates adding/removing
instances.
3.1. Concept-independent Preprocessing
The first preprocessing step is performed only once for
the entire dataset. This concept-independent stage determines the set of nearest neighbors of each instance.
At the end of this stage, we have a data structure
containing the indexes and distances of the k nearest
neighbors of every instance in the dataset.
Computing the nearest neighbors of every instance in
the dataset can be an expensive operation for large
datasets. Given n training instances, naive determination of nearest neighbors takes O(n2 ) time. However,
if only an approximate set of k nearest neighbors will
suffice, then relatively efficient algorithms are available in database literature. One such approach, locality sensitive hashing (LSH) (Gionis et al., 1999), has
been shown to be especially useful in obtaining approximate nearest neighbors with high accuracy1 . Recent
1
Increasing the number of hash functions considerably
reduces the error rate. Though the computational cost
increases with increase in the number of hash functions,
Concept Boundary Detection for Speeding up SVMs
More specifically, a family of functions is called
locality-sensitive if, for any two instances, the probability of collision decreases as the distance between
them increases. In essence, LSH uses multiple hash
functions, each using a subset of the features, to map
instances to buckets (A similar idea is used in (Achlioptas et al., 2002) to approximate inner products between instances). For every instance xi whose approximate nearest neighbors are sought, LSH uses the same
hash functions to map the instance to buckets. All
instances in the dataset mapping to the same buckets are then examined for their proximity to xi . Interested readers are referred to (Gionis et al., 1999)
for details on LSH. The total cost of determining the
neighborhood list for all n instances is O(n m), where
m is the average number of instances mapping into
the same bucket, and typically m << n. The structure containing the k approximate nearest neighbors
of each instance is obtained by examining the buckets to which each instance hashes. Since this stage is
label-independent, its cost is a one-time expense for
the entire dataset.
1
0.8
Score
work by (Liu et al., 2005) has shown vast improvements in performance over LSH. We reiterate that any
such approximate-NN approach may be used to determine the neighborhood lists of all instances.
0.6
0.4
0.2
0
τj
Distance
Figure 1. Plot of Scoring Function
yi ∈ {−1, 1}. Let xi and xj be instances from different classes (i.e., yi 6= yj ). Let kNN(xj ) be the
top-k nearest-neighbor list of instance xj . When we
specifically refer to the k th nearest neighbor of xj ,
we use the notation of kNNk (xj ). When we express
xi ∈ kNN(xj ), we say that instance xi is on the the
top-k neighborhood list of xj . Let c(xi , xj ) denote the
score accorded to xi by xj and τj be the square of the
distance from xj to the closest instance of the opposite
class on its neighborhood list. Our proposed scoring
function is given by
c(xi , xj ) = exp(−
k xi − xj k22 −τj
).
γ
(1)
3.2. Concept-specific Sampling
The parameter, γ, for the exponentially decaying scoring function is the mean of k xi −xj k22 −τj . The exponential decay is just one possible choice for the scoring
function and is used because of the ease of parameter
determination. The score accorded to xi by xj has the
maximum value, 1, when it is the first instance of the
opposite class in the neighborhood list of xj . The score
decays exponentially as the proximity to the instance
xj decreases. Figure 1 shows the variation of the scores
with increase in distance from the instance. The peak
occurs at the distance of the nearest neighbor from the
opposite class and then decreases continuously.
The second step is the concept-dependent stage where,
given the class labels of the instances, a subset of the
instances is selected to be used as input for the learning
algorithm. We adopt the idea of continuous weighting
of instances, based on their proximity to the boundary, to develop a scoring function for instances. The
objective of the scoring function is to accord higher
scores to instances closer to the boundary between the
positive and negative classes.
In order to obtain the cumulative score of xi , we
sum over the contributions of all oppositely labeled
instances with xi in their neighborhood lists. Normalization by the number of contributors (#xi ) is necessary to avoid issues arising from variations in density
of data instances. The normalized score, Sxi , is given
by
X
1
c(xi , xj ),
(2)
Sxi =
# xi
Given the training data vectors {x1 . . . xn } in space
X ⊆ R d , for a specific target concept, we also know
the labels {y1 . . . yn } of the training instances where
where, #xi is the number of instances with xi ∈
kNN(xj ) and yi 6= yj .
Insertions and Deletions: Insertion of a new instance into the dataset involves computing the buckets to which it hashes. The neighborhood list of the
new instance is formed by examining the proximity of
instances in the dataset mapping to the same buckets.
Also, the neighborhood lists of only the instances in
the mapped buckets may change upon addition of the
new instance. Deletion of instances is noted by setting
a flag.
this cost is much lower than the cost of the brute force
algorithm. Our experiments did not show a significant difference in overall accuracy using LSH.
xj s.t. xi ∈kNN(xj )
Let us use an example in Figure 2 to illustrate how
the concept-specific step works. In the figure, the positive instances are enclosed within the solid curve, and
the negative instances lie outside. For the purpose
Concept Boundary Detection for Speeding up SVMs
−
−
−
−
−
−
−
−
−
+
−
x2
−
−
−
+
_
x2
+
+
+
−
−
x3
−
−
−
+
+
−
−
−
−
Notation
n
= # of instances in dataset
k
= # of nearest neighbors
Sx i
= Normalized score of instance xi
#xi = # of contributors to the score of instance xi
kNNk (xi )= k th nearest neighbor of instance xi
Dxj i = Squared distance between xi and kNNj (xi )
−
−
−
−
−
−
−
−
−
−
−
−
+
+
+
−
−
x+1
_
x1
−
_
+
x+3 +
−
+
+
+
+
− −
−
+
+
−
−
−
−
−
−
−
−
−
−
−
−
−
−
Figure 2. Functioning of the Scoring Function
of demonstration, we pick three of the positive in+
+
stances x+
1 , x2 , x3 and three of the negative instances
−
−
−
x1 , x2 , x3 . These instances are placed at varying distances from the boundary and from each other. Analyzing the scores accorded to the positive instances
−
+
+
x+
1 , x2 , x3 by the negative instance x1 , we see that
the maximum increment in their respective scores is re+
ceived by the nearest neighbor, x+
3 , and then x2 and
+
x1 . Considering only instances of the opposite class in
the neighborhood list, instances towards the top of the
list tend to enjoy a higher normalized scores. Since the
objective is to prune out instances far from the boundary, selection of instances based on their normalized
score helps us obtain the desired subset. Similar rea−
soning applies to the higher scores of x−
1 and x2 when
−
compared with the score of x3 .
The process of scoring involves every instance examining its neighborhood list for instances of the opposite
class and then computing scores. Since the values of
the distances of the nearest neighbors have also been
stored in the neighborhood lists, the values of τj and
k xi − xj k2 are already available. Figure 3 presents
the steps of the scoring algorithm for reference.
procedure Determine Scores
Input : n, k, X, y
Output : S
/* Determine exponential decay parameter γ */
γ=0
counter = 0
for i = 1 to n
nearest-opposite-neighbor-found = false
for j = 1 to k
if yi 6= ykNNj (xi )
if (!nearest-opposite-neighbor-found)
nearest-opposite-neighbor-found = true
τi = Dxj i
γ = γ + Dxj i − τi
counter = counter + 1
γ = γ/counter
/* Determine the scores of instances */
for i = 1 to n
nearest-opposite-neighbor-found = false
for j = 1 to k
if yi 6= ykNNj (xi )
if (!nearest-opposite-neighbor-found)
nearest-opposite-neighbor-found = true
τi = Dxj i
SkNNj (xi ) + =exp(−
#kNNj (xi ) ++
for i = 1 to n
Sxi = Sxi /#xi
return S
j
−τi
Dx
i
γ
)
Figure 3. Computation of Scores
As can be seen in Figure 3, the outer loop iterates over
all instances in the dataset and the inner loop over the
nearest neighbors of each instance. Since we have assumed that the number of instances on the neighborhood list is k, the total cost of the operation is O(n k).
Since k is a relatively small constant compared to n,
this is linear with respect to the number of instances
in the dataset. The obtained scores of the instances
are then sorted and a subset of the training instances
is selected with a preference towards instances with
higher scores. Sorting the instances on scores adds
an additional cost of O(n log n). This stage takes
O(n k + n log n) time.
neighborhood lists does not come into play at the concept learning stage and is amortized over all concepts
in the dataset. These concepts include the concepts
already formulated and the concepts which may be
formulated using the dataset instances in the future.
The underlying assumption is that the number of concepts in the dataset, L, is large. The amortized cost of
2
n
forming the neighborhood list is given by O( n log
)
L
n m+n m log m
) using
using the naive algorithm and O(
L
LSH.
Note that, the neighborhood lists have already been
constructed at the end of the concept-independentpreprocessing step (performed only once for the entire dataset) and do not need to be reconstructed on a
per-concept basis. Thus, the cost of constructing the
The scoring algorithm devised above is partial to outliers in that, outliers almost certainly receive high
scores. However, indications of an instance being an
outlier can be obtained by examining its neighborhood
list. For example, in a balanced dataset, an instance
Concept Boundary Detection for Speeding up SVMs
whose list is made up of more than 95% instances from
the opposite class is with high probability an outlier
that can be removed from consideration.
3.3. Selection of Subset Size
Having obtained the scores of instances in the dataset,
we now need to prune out instances based on their
scores. We outline two strategies for the choice of the
number of instances.
The first strategy leaves the choice of the number of
instances to be chosen with the user. Given the computational resources available, a user may choose the
number of instances that form the subset. Assuming a
worst-case behavior of the learning algorithm (in case
of SMO the worst case behavior has been empirically
demonstrated to be O(n2.3 ) (Platt, 1998)), the number of instances forming the subset can be suitably
picked. The actual instances are then chosen by picking the required number of instances with the highest
scores.
The second strategy involves using the distribution of
the scores of the instances to arrive at the size of the
subset. We first obtain the sum, G, of the
Pn sorted scores
of all instances in the dataset (G = i Sxi ). Then,
starting with the instance with the highest score, we
keep picking instances till the sum of the scores of selected instances is within a pre-specified percentage
of G, the goal being to select instances contributing
significantly to the overall score. In the section evaluating the performance of our approach (Section 4), we
present performance measures with multiple percentage choices.
Example functioning of the boundary detection algorithm is presented on three toy datasets in Figures 4,
5, and 6. In all the dataset figures, the first figure
presents the distributions of the positive and negative instances (indicated by separate colors) and the
second figure presents the boundary instances picked.
The first dataset demonstrates the boundary for instances derived from two different normal distributions. The second dataset consists of points generated randomly and labeled according to their distances
from pre-chosen circle centers. The third dataset consists of a 4 by 4 checkerboard. The data instances
were generated randomly and labeled according to the
basis of the square occupied by them. The average
number of instances selected is about 10% of the original dataset size in each of the cases. These are toy
datasets in 2D; results for higher-dimensional datasets
are presented in the section presenting experimental
validation (Section 4).
4. Experiments
We performed experiments on five datasets (details follow shortly) to evaluate the effectiveness of our approach. The objectives of our experiments were
• to evaluate the speedup obtained,
• to evaluate the effects of parameters k and G on
our algorithm, and
• to evaluate the quality of the classifier using our
algorithm.
All our experiments were performed using the Gaussian RBF kernel exp(−ψ k xi − xj k22 ). The parameters used for the datasets (chosen on the basis
of experimental validation) are reported in Table 1.
The training of SVMs was performed using SVMLight (Joachims, 1999). The value of k was set to
100 in all the experiments. We also present results
for experiments with other values of k ranging over
200, 300, 400, 500 and 600 for the Mnist dataset. The
choice of k = 100 lowers both storage and processing
costs while retaining reasonable accuracy levels. We
show that that a larger k may not be helpful in improving classification accuracy. We report average results over five runs. The experiments were performed
on a Linux machine with a 1.5GHz processor and 1GB
DRAM. The quality measures used in our experiments
are accuracy (percentage of correct predictions), and
traditional precision/recall.
Dataset
ψ
Training
Testing
Table 1. Dataset Details
Mnist
1.666
60000
10000
Letter
9.296
16000
4000
25K
0.1666
18729
6271
Corel
0.111
42386
8260
4.1. Datasets
The Mnist dataset (LeCun et al., 1998) consists of
images of handwritten digits. The training set contains 60, 000 vectors each with 576 features. The test
set consists of another 60, 000 images. However, instead of the entire test set, we used 10, 000 instances
for evaluation as in (LeCun et al., 1998).
The letter-recognition dataset available in the UCI
repository consists of feature descriptors for the 26
capital letters of the English alphabet. Of the 20, 000
instances in the dataset, 16, 000 were randomly chosen
as the training set, with the rest forming the test set.
The 25K-image dataset contains images gathered
from both web sources and the Corel image collection categorized into over 400 categories. Of these,
we present results for the 12 largest categories. 75%
of the dataset was used as the training set while the
remaining instances served as the test set.
Concept Boundary Detection for Speeding up SVMs
Figure 5. Circular +ve Class
The Corel and Corbis (http://pro.corbis.com)
datasets contain about 51K and 315K images respectively. The categories in the two datasets number more
than 500 1, 100 respectively. Images in some related
categories in the Corel dataset were grouped together.
The details of the groupings and the top categories
chosen for evaluation are presented in Table 2. Feature extraction yielded a 144-dimension feature vector
representing color and texture information for each image. 15% of the dataset was randomly chosen as the
test set with the rest being retained as the training
set.
Table 2. Corel Categories
0 :
3 :
6 :
9 :
12:
Asianarc
Flora
Magichr
Objects(I-VIII)
Water
1 :
4 :
7 :
10:
13:
Ancient Architecture
Architecture(I-X)
Museums
Textures
Coastal
2 :
5 :
8 :
11:
14:
Food
Landscape
Old Buildings
Urban
Creatures(I-V)
4.2. Results
We report the results of our experiments on the Mnist
dataset in Table 3 and Figures 7 to 10. Figure 7
presents the variation of the accuracy with different
choices of percentages of G. After computing the total sum, G, of the scores of all instances, the subset
was determined by choosing instances making up the
top 30, 50, 70, 80, 90 and 99% of G. The 100% curve
presents the performance using SVMs on the entire
training set. Figures 8 and 9 present precision and recall figures with only the higher percentages (> 70%).
Figures 10 and 11 present the variation of speedup
with the different percentages. These figures show
that when G > 70%, the training data selected by
our boundary-detection algorithm can achieve about
the same testing accuracy compared to using the entire dataset for training. When taking the conceptindependent preprocessing (first step) time into consideration, our algorithm can achieve an average of
five time speedup. When we do not consider the preprocessing time, the speedup can exceed ten times.
Table 3 details the results for G = 80% on the Mnist
dataset. In the table, we compare our approach with
bagging and IVM. The table presents qualitative comparisons for the accuracy, precision and recall achieved
by the proposed technique as compared to SVMs on
entire dataset. Comparisons of the number of support
Figure 6. Checkerboard
vectors selected indicate that the classification time of
unseen instances using our approach would always be
smaller. The table also presents timing comparisons
where the time taken by SVMs on the entire dataset
is compared with our approach. For our approach,
we present both the time taken for SVM learning on
subset (TL ), and the time taken to obtain the subset (TB ). Speedup computations are performed after
summing up TL , TB and the amortized list construction time (≈ 93s). Experiments with bagging indicate
that, to attain the same levels of accuracy multiple
bags need to be used and the cumulative training time
for these bags is comparable to the time taken by the
SVM learning algorithm on the entire dataset. Our approach achieves much higher speedup than IVM. The
accuracy levels in all the cases are comparable to the
accuracy levels of SVMs on the entire training dataset.
To understand the speedup values obtained we examined the decay of the scores of instances in the case
of Mnist digit 0. Figure 12 shows the variation of the
sorted scores of instances with the fraction of instances
chosen on the x-axis and the score on the y-axis. We
also present an exponential fit to the data. As can
be seen from the figure, the sorted scores of instances
show a roughly exponential decay. Such a decay helps
in limiting the number of instances picked, even when
we select instances with sum of scores within a high
percentage of the respective sum G.
1
Score
Exponential Fit
0.9
Normalized Score
Figure 4. Gaussian Classes
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Instance / Num-data
Figure 12. Mnist Digit 0 Dataset
Results with k values ranging from 200 to 600 are
presented in Figure 13. The accuracy figures remain
almost unaltered when compared to the figures for
Concept Boundary Detection for Speeding up SVMs
100
99.95
99.9
99.9
99.8
99.85
99.8
99.6
99.4
Accuracy
99.6
99.5
99.4
99.75
99.7
30
50
70
80
99.3
99.2
0
2
4
6
8
Precision
99.8
99.7
Accuracy
100
80
90
99
100
98.8
99.6
98.6
99.55
98.4
2
4
6
8
12
70
80
90
99
100
4
99
10
8
Speedup
98.5
98
97.5
97
10
5
2
6
8
0
10
0
2
4
Digit
6
8
10
0
30
50
70
80
90
99
0
2
Digit
Figure 9. Recall (Mnist)
10
15
4
4
8
20
6
96.5
6
Figure 8. Precision (Mnist)
30
50
70
80
90
99
Speedup
99.5
2
2
Digit
Figure 7. Accuracy Results(Mnist)
0
0
10
Digit
100
96
70
80
90
99
100
98.2
0
Digit
Recall
99
99.65
99.5
10
99.2
4
6
8
10
Digit
Figure 10. Speedup (Mnist)
Figure 11. Speedup w/o Stg. 1
Table 3. Results for Mnist Dataset (SVM on whole dataset, Subset within 80% of G) (CBD – Our approach, T L – Time
taken by learner on subset, TB – Time taken for score evaluation, CBD* – Speedup without list construction cost)
Accuracy
SVM
CBD
99.93
99.92
99.89
99.89
99.79
99.78
99.80
99.83
99.80
99.78
99.82
99.84
99.82
99.80
99.75
99.75
99.74
99.68
99.55
99.56
#
0
1
2
3
4
5
6
7
8
9
Precision
SVM
CBD
100.00
99.80
99.65
99.65
99.51
99.61
99.50
99.60
99.49
99.28
99.66
99.77
99.68
99.26
99.41
99.02
99.79
99.89
99.39
99.39
Recall
SVM
CBD
99.29
99.39
99.38
99.38
98.35
98.26
99.51
98.71
98.47
98.47
98.32
98.43
98.43
98.64
98.15
98.54
97.54
96.82
96.13
96.23
|SV |
SVM
CBD
2215
896
1772
775
3156
1897
2915
2004
2784
1552
2960
1735
2142
1093
2801
1630
3217
2225
2829
2175
99.95
99.9
99.85
99.8
99.75
99.7
99.65
99.6
99.55
99.5
200
300
400
500
600
5
4
Speedup
Accuracy
k = 100, while the speedup figures dip because of the
extra processing needed.
200
300
400
500
600
0
3
2
1
1
2
3
4
5
6
7
8
9
0
0
1
2
3
Digit
4
5
6
7
8
9
Digit
100
99.8
99.6
99.4
99.2
99
98.8
98.6
98.4
98.2
3.5
Speedup
T(sec)
TL
15.88
15.09
98.65
111.12
70.78
93.19
23.41
75.23
159.97
121.54
TB
59.33
56.87
55.84
56.85
55.85
57.69
56.94
83.08
56.39
83.12
Bagging
1.12
1.05
1.15
1.14
1.07
1.15
1.05
1.13
1.19
1.21
Speedup
IVM
CBD
2.04
6.15
2.02
5.14
1.76
5.92
1.68
5.32
2.02
5.88
1.57
5.68
1.98
5.97
2.03
5.09
1.93
4.69
1.42
4.33
CBD*
14.79
11.58
9.37
8.17
10.06
9.07
12.67
7.98
6.64
6.23
Similar experiments were also performed with the
Letter-recognition, the 25K-image and the image
datasets. Because of space limitations, we present only
results with the highest percentage (99% of G) of selected instances for these datasets. Figure 14 presents
the variation of the accuracy and speedup over the
26 different categories in the letter-recognition dataset
(We include concept-independent preprocessing time
in all cases). Figures 15, 17 and 16 present the results
over different categories of the image datasets. Average speedup values for these datasets were about 3.5,
3 and 10 times respectively. Accuracy results for both
the datasets were close to those obtained by SVMs on
the entire dataset.
3
Speedup
Accuracy
Figure 13. Accuracy and Speedup Variation with k (Mnist)
SVM
1112.44
833.93
1447.90
1372.72
1273.93
1369.26
1018.45
1264.60
1438.34
1276.75
5
10
15
Class
2
1.5
Subset
Original
0
2.5
20
25
1
0
5
10
15
Class
20
Figure 14. Accuracy & Speedup (Letter-recognition)
25
Our experiments indicate that over all the categories
in each of the datasets, our proposed approach attains
speedup over the SVM training algorithm while retaining reasonably high accuracy levels. Experiments
over diverse choices of k and percentages of G indi-
Concept Boundary Detection for Speeding up SVMs
99.8
99.6
99.4
99.2
99
98.8
98.6
98.4
98.2
98
5
Subset
SVM
Speedup
4.5
4
Speedup
Accuracy
cate that the proposed technique is not very sensitive
to the choices of these parameters and even a small
neighborhood list of 100 instances suffices.
3
2.5
2
2
4
6
8
10
1
12
0
2
4
Class
6
8
10
12
Class
7
Subset
SVM
5
4
3
2
0
2
4
6
8
10
12
1
14
0
2
4
Class
6
8
10
12
14
Class
Speedup
12
Speedup
Accuracy
14
Subset
SVM
10
8
6
4
2
0
5
10 15 20 25 30 35 40
Class
0
0
5
10
15
20
25
30
Joachims, T. (1998). Making large-scale svm learning practical. Advances in Kernel Methods - Support Vector
Learning.
Joachims, T. (1999). Making large-scale svm learning practical. Advances in Kernel Methods - Support Vector
Learning. MIT-Press.
Figure 16. Accuracy & Speedup (Corel)
99.8
99.7
99.6
99.5
99.4
99.3
99.2
99.1
99
98.9
Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity
search in high dimensions via hashing. The VLDB Journal (pp. 518–529).
Graf, H. P., Cosatto, E., Bottou, L., Dourdanovic, I., &
Vapnik, V. (2005). Parallel support vector machines:
The cascade svm. In L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances in neural information processing
systems 17, 521–528. MIT Press.
Speedup
6
Speedup
Accuracy
Figure 15. Accuracy & Speedup (25K)
99.8
99.6
99.4
99.2
99
98.8
98.6
98.4
98.2
98
97.8
Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at
http://www.csie.ntu.edu.tw/ cjlin/libsvm.
Fine, S., & Scheinberg, K. (2001). Efficient SVM Training Using Low-Rank Kernel Representation. Journal of
Machine Learning Research, 243–264.
3.5
1.5
0
methods. In Advances in Kernel Methods: Support Vector Learning, MIT Press.
35
40
Class
Figure 17. Accuracy & Speedup (Corbis)
5. Conclusion
We have described an efficient strategy to obtain instances close to the boundary between classes. Our experiments over more than 100 different concepts indicate the applicability of the method in speeding up the
training phase of support vector machines. An interesting aspect of the proposed approach is the smaller
number of support vectors defining the classifier. Having fewer support vectors bodes well for faster classification of the test instances. As future work, we would
like to explore the performance of the approach with
other learning algorithms having robust loss functions
like SVMs. In particular, we would like to develop a
robust algorithm which generalizes efficiently to the
multi-class case.
Lawrence, N. D., Seeger, M., & Herbrich, R. (2003). Fast
sparse gaussian process methods: the informative vector machine. S. Becker, S. Thrun and K. Obermayer
(eds) Advances in Neural Information Processing Systems. MIT Press.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998).
Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.
Liu, T., Moore, A. W., Gray, A., & Yang, K. (2005). An
investigation of practical approximate nearest neighbor
algorithms. In L. K. Saul, Y. Weiss and L. Bottou (Eds.),
Advances in neural information processing systems 17,
825–832. Cambridge, MA: MIT Press.
Osuna, E., Freund, R., & Girosi, F. (1997). An improved
training algorithm for support vector machines. IEEE
Workshop on Neural Networks for Signal Processing.
Pavlov, D., Chudova, D., & Smyth, P. (2000). Towards
scalable support vector machines using squashing. ACM
SIGKDD (pp. 295–299).
Platt, J. (1998). Sequential minimal optimization: A fast
algorithm for training support vector machines (Technical Report). Microsoft Research.
Smola, A. J., & Schölkopf, B. (2000). Sparse greedy matrix
approximation for machine learning. ICML.
References
Tresp, V. (2001). Scaling kernel-based systems to large
data sets. Data Min. Knowl. Discov., 5, 197–211.
Achlioptas, D., McSherry, F., & Schölkopf, B. (2002). Sampling techniques for kernel methods. Advances in Neural
Information Proc. Systems.
Vapnik, V. (1995). The nature of statistical learning theory.
New York: Springer.
Breiman, L. (1996). Bagging predictors. Machine Learning,
24, 123–140.
Yu, H., Yang, J., & Han, J. (2003). Classifying large data
sets using svm with hierarchical clusters. In Proceedings
of ACM International Conference on Knowledge Discovery and Data Mining.
Burges, C. (1999). Geometry and invariance in kernel based
Download