Modified Support Vector Machine for Detecting Bikini Images 李宏毅 R97942033, 李倫銓 D97921013 tlkagkb93901106@yahoo.com.tw, d97921013@ntu.edu.tw Abstract The paper use two novel modified support vector machine (SVM), cluster-based SVM and multi-kernel SVM tree, for more robust training. Both novel methods can outperform traditional SVM training. 1. Introduction Recently, pornographic or bikini image detection has become an important and useful problem in web mining and image retrieval. Especially nowadays lots of people are download pictures from web album or blogs. In our term project, we try to detect pornographic or bikini images from a collection of normal images and overcome some difficulties of the task. (In this project we wrote a spider to scratch images from wrech album. We use these pictures for this project research only and we will then delete these pictures). 2. Difficulty Support vector machine is the method we used in our task. Although support vector machine is a powerful machine learning method for classification, some difficulties would appear when using SVM as our training method. Consider the image examples in Fig. 1. If we want to retrieve image about bikinis, all the images in Fig. 1 should be retrieved although they look different. However, in feature space, images in Fig. 1 may not gather together. (Fig. 2) If we select hyper-plane 1 to separate bikinis from other images, we would miss lots of images about bikinis. However, if we select hyper-plane 2 as separating hyper-plane, we would get lots of images irrelevant to bikinis. Kernel functions may be a useful tool to solve the above problem, but sometimes kernel functions would result in over-fitting. In our experiment, for kernel function with high dimensions (ex. radial basis kernel function), SVM would get a model with zero missing rate and zero false alarm in training data but perform terrible on testing data. Two methods are proposed to make SVM more robust. We implement the two methods on our task pornographic images detection. Fig. 1 Fig. 2 2. Cluster-based SVM For pornographic images may be very different, we cluster the pornographic images of training data by k-means (Fig. 3). Then each cluster is used to train an SVM model (Fig. 4). Eight clusters are used in our experiment. Fig. 3 Fig.4 3. Multi-kernel SVM tree The following graph illustrates the training procedure of SVM tree by using data of Fig. 2: The procedure above is a simple SVM tree which may outperform any hyper-plane in Fig. 2. SVM tree is a kind of decision tree, but the questions used in SVM tree are hyper-planes. SVM tree can be viewed as a decision tree considering several questions at a same time. However, in reality, in our experiment, SVM tree cannot outperform conventional SVM significantly because SVM is so powerful that after separating features into two nodes, it cannot separate the nodes anymore nevertheless lots of wrong classification still exists in the nodes. We modify SVM tree into multi-kernel SVM tree. We choose a kernel function from a defined kernel function set which can decrease wrong classification of development set to split a node. MULTI-KERNEL SVM TREE ALGORITHM SPLIT-NODE(all training data, all development data) END of ALGORITHM SPLIT-NODE(training set, development set) { take a kernel function f from kernel function set use function f to train a SVM model m using training set use model m to classify development set if (using model m to classify development set can decrease number of misclassification) using model m to classify training set into training set A and training set B using model m to classify training set into development set A and development set B SPLIT-NODE(training set A, development set A) SPLIT-NODE(training set B, development set B) break end }until(every kernel function in kernel function set has been used) 4. Experiment 4.1 Data 6879 images are gathered by spider. 712 images are labeled pornographic by manual, and the other 6167 images are considered as normal. 4 fold cross validation are conducted on all the experiments below. 4.2 Image feature We use color histogram and Gabor texture in HSV color space as our feature. There are 186 features for each image, 162 features are color histogram, and 24 features are Gabor texture. 4.3 Evaluation Merit The evaluation merit would be different from retrieval because our goal is going to detect pornographic images and then reject them instead retrieve them. Some term must be defined first: We use “loss” to defined the performance of our system. We set , so 4.4 Experiment Result . Single SVM Linear kernel Best kernel* Cluster-based SVM Multi-kernel SVM tree Missing rate 15.87 32.86 40.17 20.37 False alarm rate 40.44 17.64 9.50 23.17 Loss 56.31 50.5 49.67 43.54 *Average the performance of the kernel function which performs best in a fold Table 1 4.6 Discussion From Table 1, we can observe that multi-kernel SVM tree even outperform selecting the kernel which perform best on each fold of testing set. The observation verify that multi-kernel SVM tree is useful. 5. Conclusion and Future Work Multi-kernel SVM tree has some spirit the same as AdaBoost. AdaBoost classify a feature by combining several classifiers with different weight. However, Multi-kernel SVM tree separate features into several areas, and then defined the class of the feature by model trained by local data. Multi-kernel SVM tree can be developed to combine several classifiers, and it is able to focus on local information which is different from AdaBoost. Combining several machine learning algorithm instead of only SVM may be instresting.