report - Web Mining and Information Retrieval Laboratory

advertisement
Modified Support Vector Machine for Detecting
Bikini Images
李宏毅 R97942033, 李倫銓 D97921013
tlkagkb93901106@yahoo.com.tw, d97921013@ntu.edu.tw
Abstract
The paper use two novel modified support vector machine (SVM), cluster-based
SVM and multi-kernel SVM tree, for more robust training. Both novel methods can
outperform traditional SVM training.
1. Introduction
Recently, pornographic or bikini image detection has become an important and
useful problem in web mining and image retrieval. Especially nowadays lots of people
are download pictures from web album or blogs. In our term project, we try to detect
pornographic or bikini images from a collection of normal images and overcome
some difficulties of the task. (In this project we wrote a spider to scratch images from
wrech album. We use these pictures for this project research only and we will then
delete these pictures).
2. Difficulty
Support vector machine is the method we used in our task. Although support
vector machine is a powerful machine learning method for classification, some
difficulties would appear when using SVM as our training method.
Consider the image examples in Fig. 1. If we want to retrieve image about
bikinis, all the images in Fig. 1 should be retrieved although they look different.
However, in feature space, images in Fig. 1 may not gather together. (Fig. 2) If we
select hyper-plane 1 to separate bikinis from other images, we would miss lots of
images about bikinis. However, if we select hyper-plane 2 as separating hyper-plane,
we would get lots of images irrelevant to bikinis. Kernel functions may be a useful
tool to solve the above problem, but sometimes kernel functions would result in
over-fitting. In our experiment, for kernel function with high dimensions (ex. radial
basis kernel function), SVM would get a model with zero missing rate and zero false
alarm in training data but perform terrible on testing data. Two methods are
proposed to make SVM more robust. We implement the two methods on our task
pornographic images detection.
Fig. 1
Fig. 2
2. Cluster-based SVM
For pornographic images may be very different, we cluster the pornographic
images of training data by k-means (Fig. 3). Then each cluster is used to train an
SVM model (Fig. 4). Eight clusters are used in our experiment.
Fig. 3
Fig.4
3. Multi-kernel SVM tree
The following graph illustrates the training procedure of SVM tree by using data
of Fig. 2:
The procedure above is a simple SVM tree which may outperform any
hyper-plane in Fig. 2. SVM tree is a kind of decision tree, but the questions used in
SVM tree are hyper-planes. SVM tree can be viewed as a decision tree considering
several questions at a same time.
However, in reality, in our experiment, SVM tree cannot outperform
conventional SVM significantly because SVM is so powerful that after separating
features into two nodes, it cannot separate the nodes anymore nevertheless lots of
wrong classification still exists in the nodes. We modify SVM tree into multi-kernel
SVM tree. We choose a kernel function from a defined kernel function set which can
decrease wrong classification of development set to split a node.
MULTI-KERNEL SVM TREE ALGORITHM
SPLIT-NODE(all training data, all development data)
END of ALGORITHM
SPLIT-NODE(training set, development set)
{
take a kernel function f from kernel function set
use function f to train a SVM model m using training set
use model m to classify development set
if (using model m to classify development set can decrease number of
misclassification)
using model m to classify training set into training set A and training
set B
using model m to classify training set into development set A and
development set B
SPLIT-NODE(training set A, development set A)
SPLIT-NODE(training set B, development set B)
break
end
}until(every kernel function in kernel function set has been used)
4. Experiment
4.1 Data
6879 images are gathered by spider. 712 images are labeled pornographic by
manual, and the other 6167 images are considered as normal. 4 fold cross validation
are conducted on all the experiments below.
4.2 Image feature
We use color histogram and Gabor texture in HSV color space as our feature.
There are 186 features for each image, 162 features are color histogram, and 24
features are Gabor texture.
4.3 Evaluation Merit
The evaluation merit would be different from retrieval because our goal is going
to detect pornographic images and then reject them instead retrieve them. Some
term must be defined first:
We use “loss” to defined the performance of our system.
We set
, so
4.4 Experiment Result
.
Single SVM
Linear kernel
Best kernel*
Cluster-based
SVM
Multi-kernel
SVM tree
Missing rate
15.87
32.86
40.17
20.37
False alarm rate
40.44
17.64
9.50
23.17
Loss
56.31
50.5
49.67
43.54
*Average the performance of the kernel function which performs best in a fold
Table 1
4.6 Discussion
From Table 1, we can observe that multi-kernel SVM tree even outperform
selecting the kernel which perform best on each fold of testing set. The observation
verify that multi-kernel SVM tree is useful.
5. Conclusion and Future Work
Multi-kernel SVM tree has some spirit the same as AdaBoost. AdaBoost classify
a feature by combining several classifiers with different weight. However,
Multi-kernel SVM tree separate features into several areas, and then defined the
class of the feature by model trained by local data. Multi-kernel SVM tree can be
developed to combine several classifiers, and it is able to focus on local information
which is different from AdaBoost. Combining several machine learning algorithm
instead of only SVM may be instresting.
Download