A Novel Method for Pornography Picture Detection in

advertisement
Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013
A NOVEL METHOD FOR PORNOGRAPHY PICTURE DETECTION IN LARGE
SCALED SYSTEM
Hoang Trung Nguyen1, Minh Quang Nguyen2
1
R/D Department, Naiscorp joint stock company, Hanoi, Vietnam. *E-mail: trungnh@socbay.com
2
Department of Software Engineering, Faculty of Information and Technology, Hanoi
National University of Education, Hanoi, Vietnam.
Abstract: This paper presents a method for detecting pornography or nudity in color image
for large scale upload system. Based on a trend of uploaded pictures that nudity is often
uploaded in group, we apply an algorithm to retrieve a probability of each picture and then
decide if the picture group is pornography or not. To calculate the probability of each picture,
we use three color spaces to increase the precision of detection: RGB, HSV and IHLS. The
skin regions are analyzed for clues indicating nudity or non-nudity such as their sizes and
relative distances from each other. The proposed pornography picture detection method for
large scale upload picture system is able to detect nudity with a 95.23% recall and an 0.04 %
false positive rate. It correctly identified 5098 out of 5353 nude images.
Keywords: nudity detection, uploaded pictures, color space, skin region, pornography
detection.
Introduction: The purpose of our method is to detect pornography images uploaded to our
large stored system. With 200.000 users recently, this system includes more than 50 millions
pictures and 100.000 pictures is added everyday.
Pornography images consist of a lot of skin in their area and therefore pattern
recognition for color skin is the primary trend of many researches. Different mechanisms are
used for skin detection problems by color content. The popular effort of almost research is to
detect nudity feature with the content of individual pictures. Lee and Yoo [7] used an
elliptical boundary model to classify skin chrominance from non-skin chrominance. The
nudity detection [18] based on three color spaces to determine skin regions in a picture. From
the percentage of three largest regions, the picture is considered as nudity or non-nudity.
Basically, the decision is made from the probability, i.e., the picture with large areas of skin
will be defined as nudity.
Almost these approaches are similar in that they all detected nudity characteristic for
particular image without information about other items uploaded at the same section with it.
In this paper, we proposed a new method to define a picture as nudity or not, based on the
group or album it is belong to and the probability of each item in these albums. The method
emerged from a observation that nudity pictures are often uploaded with group. We use
thresholds to decide the picture group is nudity or not, if the number of pictures having
probability to be nudity is larger than threshold.
Our work came from the common behavior of users in a upload system, where
pornography pictures often go together in groups or album. Based on the images collected for
6 months, there is a trend that user often upload nudity pictures together in groups, from 10 to
50 images each group. In consequences, we apply a mechanism to detect nudity in group.
Contrary to other detection systems, where images are detected particularly, we use algorithm
defining the probability of a particular image. The conclusion whether the group is nudity or
not would be conduct with the probabilities of images in group.
To specify probability of single image, we use combination of three color spaces:
RGB, HSV and normalized RGB color space and apply an algorithm for Nudity Detection
[16]. This algorithm is available for three kinds of pictures: JPEG, BMP and PNG. The
extension, however, can be easily reached for classification in other formats.
Methodology:
1
Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013
Skin detection
The first basic step is to determine the presence of objects which probably appear
pornography. It need an analysis contents of images, including color, filename, dimensions,
texture, the shapes of object presented [1], [6], [17]. Since there is a strong connection
between the proportion of skin area and nudity picture, we use the information of color and
extract some clues from them.
Basically, an image is presented as a matrix of pixels, each pixel represents a point
with a specific color. There are several way to express colors, depend on the choice of color
space. Up to now, no obvious reason to conclude which color space is optimized for detection
system [8]. To improve the precision of skin detection, we apply three color spaces as follow:
- The RGB Color Space: The red-green-blue (RGB) color space is a traditional color model
for computer graphics, originally developed for cathode ray tubes (CRTs). To specify a
color, RGB color space use the combination of three indexes, corresponding to three
components: red, blue and green. The scale for each component ranges from 0 to 255.
Since brightness and color are grouped in the RGB, it is not useful for color segmentation
in images without conditions of light [1].
- HLS and HSV: HLS (hue, lightness and saturation) and HSV color space are commonly
used in image processing. The base color is chosen as an angle in a circular disc. The
saturation is given by the distance from origin, where the saturation increases with the
distance from the origin. This can be represented as a two-dimensional circular map, but
when introducing the lightness, the model needs a third dimension. This is done by adding
an axis perpendicular to the hue and saturation, i.e. a Z-axis. The model is called 3D-polar
coordinate color space.
The components is given by equations below
Some studies show that HSV is invariant to highlights at white light sources, to matte
surfaces, and ambient lighting.
The IHLS IHLS is a modified color space presented in [19], where the I before HLS stands
for improved. The idea is to use the properties of the HLS color space, but with a significant
difference. In HLS and HSV, the saturation depends on the luminance. In IHLS, the
saturation is defined in a way, where it does not depend on the value of the luminance [19, pp
2-8].
Saturation definitions for HLS, HSV and IHLS are shown below:
2
Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013
- Skin detection algorithm Human skin color, which depends mainly on hemoglobin and
melanin concentrations and lighting conditions [16], is not located randomly within a color
space. Skin colors are grouped within small areas of color space, which vary depending on
the color space used [15].
Many studies have sought to develop methods for the classification of skin-color
pixels using existing color spaces. One proposed method [18] for the detection of skin
color in RGB color space uses simple rules to rapidly construct skin-color pixel classifiers
[12]. This method uses a basic arithmetical formula to describe the relationship between
the three components of RGB color:







R ≥ 95
G ≥ 40
B > 20
Max{R, G, B} – min{R, G, B} > 15
|R - G| > 15
R>G
R>B
Other color spaces have also been used to define skin-color pixel classifiers. Areas of
color space belonging to the skin-color set have been defined using the HLS, HSV and
IHLS color spaces [18]:
For the IHLS color space, we built a static filter from the skin distribution in Weka [4]
and refined the corresponding values on test images. Finally, the following rules are
adopted [19]:
iHmax = 50, iHmin = 0, iSmax = 0.9, iSmin = 0.1
where iHmax and iHmin are the upper and lower boundary values for the hue
component, iSmax and iSmin are the upper and lower boundary values for the saturation
component of the IHLS color space.
Nudity probability calculation
To calculate the nudity probability, the algorithm used is primarily based on content
of color. More specific, picture with larger area of skin should be assigned with bigger
probability. In addition, there is an observation that pornographic pictures have many tones of
skin, and skin regions in a particular picture are often close to each other. The algorithm
considered these features as main clues to conclude the nudity probability of a picture.
The algorithm we used in our system is modified from [19]. It consists of recognizing
skin pixels in the image, locating skin regions, analyzing these regions to obtain several
clues, and then assigning a probability to each picture.
In the first step, to classify a pixel as skin or non-skin, algorithm requires a skin
distribution model. It scanned all pictures from the upper left corner to the lower right corner,
retrieving the value of each color component for all pixels. These values then transfer next
process, in which each pixel is checked to determine if they satisfy the parameters for being
skin. The proportion of skin pixels is considered as a primary attribute that effect the
probability concluded.
In the next step, after detecting skin regions, the algorithm calculates three important
features that aid in the classification: the number of skin regions, the size of the largest
regions and the relative positions of these regions. By combining the features obtained, the
probability was finally established.
In summary, the modified algorithm works in the following manner:
1. Retrieve important features
- Scan the image, starting from the upper left corner to the lower right corner.
- Retrieve values for red, green and blue component in RGB color space.
3
Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013
-
From RGB component values, calculate other values in two color spaces (IHLS and
HSV).
- Label each pixel as skin or non-skin, based on three conditions (1, 2, 3).
- Calculate the proportion of skin pixels in the image. The percentage obtained is
presented as p1.
- Locating the skin regions by identifying connected skin pixels.
- Obtain the number of skin regions and get three largest skin regions: R1, R2, R3.
- Calculate the percentage of the largest skin region area relative to the image size,
presented as Pmax.
- Construct three quadrangles with four corner of each regions: the leftmost, the
uppermost, the rightmost, and the lowermost skin pixels of the three largest skin
regions.
- Calculate the area of the three quadrangles, S1, S2 and S3 respectively.
- Calculate the number and the percentage of skin pixels within each quadrangle
relative to their area, pr1, pr2 and pr3 respectively.
- Calculate the average intensity of the pixels inside quadrangle, presented as A.
2. Assigned probability to an image as follows
- If (p1 < 15%), assign p = p1. Otherwise, go to the next step.
- If (pr1 < 35%) and (pr2 < 35%) and (pr3 < 30%) then p = (pr1 + pr2 + pr3)/3.
- If (pr1 < 45%) then p = pr1/3;
- If (p1 < 30%) and (pr1 + pr2 + pr3)/3 < 55%, then p = p1*(pr1 + pr2 + pr3)/3;
- If (p1 > 60%) and A < 0.25 then p = A * p1.
- If (p1 > 60% ) and (A > 0.4) then p = (A + p1)/2.
- Otherwise, (2*pr1 + pr2 + pr3 + A)/5.
In our experiment, we empirically determined the threshold values. The probability
obtained can be more precise if we can detect the presence of body parts such as sexual
organs. Object detection, however, is still difficult since it is not easy to distinguish
algorithmically between non-sexual organs like a hand or a face and sexual objects.
Nudity detection for large scale uploaded picture system
Our system enables users to rapidly upload pictures in large scale (www.upanh.com). The
system allows users to retrieve 200 million pictures a day.
From the observation of pictures uploaded for the recent 12 months, we saw an
important trend that almost pornography pictures are often uploaded together in a group. In
another word, pornography pictures are stored in our system in albums of users, or uploaded
batches. Consequently, we apply a proposed algorithm to detect a image album is nudity or
not. To test the performance, we implement the algorithm for albums upload in our system
for one day.
Generally, our proposed algorithm includes two main steps. Firstly, for each group to
be detected, the probability to be nudity of each images is calculated based on the first
algorithm. Then, we obtained the proportion of images which have the probability larger than
50%. If this proportion is larger than 70%, we consider the group as nudity. Otherwise, it is
non-nudity. We only considered images albums with view recorded more than 5.000 a day.
We consider the probability calculated for each picture as its point. In another word, a
picture with probability p% will have p points. The algorithm specifically works as follow:
- Extract set of pictures which have the view count more than 5000 per day.
- Apply the algorithm for each picture to obtain its probability to be nudity.
- Extract the subset of pictures which have the point more than 30. Presented as S
- With each picture of S, trace back to identify album or group containing it. Identify
set of albums or groups that contain S, presented as T. (Since an item in T may
include many pictures in S, cardinality of S will larger or equal than cardinality of T).
4
Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013
-
Calculate point for all pictures included in each item of T.
Calculate the average point of each item in T. (the average point of each album is the
average of all pictures included).
- For each album or group in T, if its average point more than 50: decide all the pictures
in this albums as nudity.
The threshold we used in this algorithm is also empirically determined.Ten accessions of
Thai slicing melon which were all local varieties collected from various places were planted
and used as an original population for line selection. Five S4-lines namely Bangpra White,
Bangpra Green, Stripe-22, Cucumber Green and Lon-3, obtained from line selection by
selfing and planting in plant-to-row were planted in Completely Randomized Design (CRD)
with 10 replications to compare their immature fruit characteristics. The studied data were
fruit length (L), fruit width (W), fruit shape index (L/W), fruit flesh thickness, fruit cavity
length and width, fruit weight and fruit skin color.
Results, Discussion and Conclusion:
- Performance of probability calculation algorithm
To estimate the performance of first algorithm, we apply for a test set which include 10,000
pornographic pictures. Experimental results suggest that the algorithm performs a high
precision in nudity image group, with 98.77% nude images is tag with the probability more
than 50. Only 19 items is tagged with point less than 30.
The result is shown in Figure 1.
Figure 1. Performance of probability calculation algorithm
- Performance of the algorithm for upload system in large scale
The experiment applied goes through 8 steps. Firstly, we extract set of 20,570 images which
have the view count more than 5,000 per day (correspond to 19.8% pictures uploaded in our
system). The second step is to applying the probability algorithm to extract a set of pictures
which point bigger than 40. As a result, we saw that 90.11% of pictures have the point less
than 40, the rest 2,036 images (9.89%), we considered as doubtful picture to be nudity or not.
The proportion of set is presented in Figure 2.
Figure 2. The proportion of point calculated
5
Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013
At the step 4, we trace back to identify album or group each picture is stored. In our
experiment, from 2,036 pictures retrieved at step 3, we figure out the set of albums and
groups that contain all of them. The set includes 498 albums which contain 5127 pictures
(correspond to 24.90% pictures in the set).
The next step is to examine all pictures in the album set T. Theoretically, the number
of images stored in the entire T is bigger than the picture retrieved in step 1. We continue to
use the probability algorithm to calculate point of each picture in the set T and average point
for each album. The result is showed in the Figure 3 and Figure 4.
The last procedure is to tag all the albums with high average point as nudity.
Empirically, we use a threshold 50 points to detect pornography albums. The test set (20,570
images) is also manually classified in to nudity or non-nudity category, with the result that
5353 images is categorized as pornography. True positives and false positives are used to
measure the performance of the proposed algorithm. On the image set, the algorithm worked
with a 95.23% recall and a 0.04 % false positive rate. It correctly identified 5098 out of 5353
nude images.
False detection in our model comes mainly from non-nude image uploaded in nudityalbum. Alternatively, we saw that there is no nudity picture that uploads in non-nude album
in our test. Figure 2 shows the result of our test.
Our work was not able to test on set by other researcher. Thus, the comparison
between our method and other algorithm is not based on the same image set. However, it can
be said that our model can performs a significant nudity-avoid and can be effectively apply
for larger scale upload image systems.
Figure 3. The distribution of album’s average points
Figure 4. Distribution of picture’s points in doubtful album
The pornography detection is important for Internet, but there is still no effective
method for preventing improper access of such information.
In this paper, we designed a simple and effective scheme on such work. Our scheme
shows high total accuracy and low mis-detection. The scheme also shows its infinite ability
of advance since it is integrated with probability algorithm. By the using of more accurate
feature selecting, we believe that the results can be better in the future.
6
Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013
Pornography detection highly depends on the domain knowledge in this kind graph.
With the help of domain knowledge, we can generate more accurate attributes for detecting
pornography.
In the future, we will improve the scheme by observing the trend of users. The system
will enable people to upload image without signing in. We also continue to study the
movement of the nudity website list, which use our system as a main storage.
In addition, we will improve the accuracy of probability algorithm by distinguish the content of
pornography in kid and adult pictures.
Acknowledgments: This paper is funded by the Ministry of Information and Telecommunication,
Vietnam.
References:
Pedro Monteiro da Silva Eleuterio & Mateus de Castro Polastro, “Identification of HighResolution Images of Child and Adolescent Pornography at Crime Scenes”, The
International Journal of ORENSIC COMPUTER SCIENCE
Lin, Y., Tseng, H. & Fuh, C. Pornography Detection Using Support Vector Machine. 16th
IPPR Conference on Computer Vision, Graphics and Image Processing (CVGIP 2003).
Kinmen, ROC.
Y. Xu, B. Li, X. Xue, and H. Lu, “Region-based pornographic image detection,” IEEE 7th
Workshop on Multimedia Signal Processing (MMSP), pp. 1–4, November 2005.
Lee, J. Y., and Yoo, S. I. 2002. An elliptical boundary model for skin color detection. In
Proc. of the 2002, International Conference on Imaging Science, Systems, and
Technology.
X. Shen, W. Wei, and Q. Qian, “The filtering of internetimages based on detecting
erotogenic-part,” in Proceedings of the Third International Conference on Natural
Computation (ICNC). Washington, DC, USA: IEEE Computer Society, 2007, pp. 732–
736.
Jae Y. Lee and Suk I. Yoo, “An Elliptical Boundary Model for Skin Color Detection”, School
of Computer Science and Engineering, Seoul National University Shilim-Dong,
Gwanak-Gu, Seoul 151-742, Korea
H. Zhu, S. Zhou, J. Wang, and Z. Yin, “An algorithm of pornographic image detection,” in
Proceedings of the Fourth International Conference on Image and Graphics (ICIG).
Washington, USA: IEEE Computer Society, 2007, pp. 801–804.
J.-L. Shih, C.-H. Lee, and C.-S. Yang, “An adult image identification system employing
image retrieval technique,” Pattern Recognition Letters, vol. 28, no. 16, pp. 2367–2374,
2007.
M. J. Jones and J. M. Rehg, “Statistical color models with application to skin detection,”
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), vol. 1, pp. 1274–1280, 1999.
W. Zeng, W. Gao, T. Zhang, and Y. Liu, “Image guarder: An intelligent detector for adult
images,” in Asian Conference on Computer Vision, Jeju Island, Korea, January 2004,
pp. 198–203.
H. Zheng, H. Liu, and M. Daoudi, “Blocking objectionable images: adult images and harmful
symbols,” in Proceedings of the IEEE International Conference on Multimedia and
Expo (ICME), June 2004, pp. 1223–1226.
Q.-F. Zheng, W. Zeng, G. Wen, and W.-Q. Wang, “Shapebased adult image detection,” in
Proceedings of the Third International Conference on Image and Graphics (ICIG).
Washington, DC, USA: IEEE Computer Society, 2004, pp.150–153.
Elgammal, A.; Muang, C.; Hu, D. Skin Detection - a Short Tutorial. Encyclopedia of
Biometrics, Verlag, 2009.
Jones, M. J., Rehg, J. M. Statistical color models with application to skin detection.
International. Journal of Computer Vision (IJCV), pages 81–96, 2002.
7
Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013
Skin Detection using HSV color space, V. A. OLIVEIRA, A. CONCI,
Wang, Y.; Yuan, B. A novel approach for human face detection from color images under
complex background. Pattern Recognition 34, pages 1983–1992, 2001.
Rigan Ap-apid, An Algorithm for Nudity Detection, College of Computer Studies De La
Salle University Manila, Philippines.
Hanbury, A and Serra, J. A 3D-polar Coordinate Colour Representation Suitable for Image
Analysis. Vienna University of Technology (2004).
8
Download