Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013 A NOVEL METHOD FOR PORNOGRAPHY PICTURE DETECTION IN LARGE SCALED SYSTEM Hoang Trung Nguyen1, Minh Quang Nguyen2 1 R/D Department, Naiscorp joint stock company, Hanoi, Vietnam. *E-mail: trungnh@socbay.com 2 Department of Software Engineering, Faculty of Information and Technology, Hanoi National University of Education, Hanoi, Vietnam. Abstract: This paper presents a method for detecting pornography or nudity in color image for large scale upload system. Based on a trend of uploaded pictures that nudity is often uploaded in group, we apply an algorithm to retrieve a probability of each picture and then decide if the picture group is pornography or not. To calculate the probability of each picture, we use three color spaces to increase the precision of detection: RGB, HSV and IHLS. The skin regions are analyzed for clues indicating nudity or non-nudity such as their sizes and relative distances from each other. The proposed pornography picture detection method for large scale upload picture system is able to detect nudity with a 95.23% recall and an 0.04 % false positive rate. It correctly identified 5098 out of 5353 nude images. Keywords: nudity detection, uploaded pictures, color space, skin region, pornography detection. Introduction: The purpose of our method is to detect pornography images uploaded to our large stored system. With 200.000 users recently, this system includes more than 50 millions pictures and 100.000 pictures is added everyday. Pornography images consist of a lot of skin in their area and therefore pattern recognition for color skin is the primary trend of many researches. Different mechanisms are used for skin detection problems by color content. The popular effort of almost research is to detect nudity feature with the content of individual pictures. Lee and Yoo [7] used an elliptical boundary model to classify skin chrominance from non-skin chrominance. The nudity detection [18] based on three color spaces to determine skin regions in a picture. From the percentage of three largest regions, the picture is considered as nudity or non-nudity. Basically, the decision is made from the probability, i.e., the picture with large areas of skin will be defined as nudity. Almost these approaches are similar in that they all detected nudity characteristic for particular image without information about other items uploaded at the same section with it. In this paper, we proposed a new method to define a picture as nudity or not, based on the group or album it is belong to and the probability of each item in these albums. The method emerged from a observation that nudity pictures are often uploaded with group. We use thresholds to decide the picture group is nudity or not, if the number of pictures having probability to be nudity is larger than threshold. Our work came from the common behavior of users in a upload system, where pornography pictures often go together in groups or album. Based on the images collected for 6 months, there is a trend that user often upload nudity pictures together in groups, from 10 to 50 images each group. In consequences, we apply a mechanism to detect nudity in group. Contrary to other detection systems, where images are detected particularly, we use algorithm defining the probability of a particular image. The conclusion whether the group is nudity or not would be conduct with the probabilities of images in group. To specify probability of single image, we use combination of three color spaces: RGB, HSV and normalized RGB color space and apply an algorithm for Nudity Detection [16]. This algorithm is available for three kinds of pictures: JPEG, BMP and PNG. The extension, however, can be easily reached for classification in other formats. Methodology: 1 Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013 Skin detection The first basic step is to determine the presence of objects which probably appear pornography. It need an analysis contents of images, including color, filename, dimensions, texture, the shapes of object presented [1], [6], [17]. Since there is a strong connection between the proportion of skin area and nudity picture, we use the information of color and extract some clues from them. Basically, an image is presented as a matrix of pixels, each pixel represents a point with a specific color. There are several way to express colors, depend on the choice of color space. Up to now, no obvious reason to conclude which color space is optimized for detection system [8]. To improve the precision of skin detection, we apply three color spaces as follow: - The RGB Color Space: The red-green-blue (RGB) color space is a traditional color model for computer graphics, originally developed for cathode ray tubes (CRTs). To specify a color, RGB color space use the combination of three indexes, corresponding to three components: red, blue and green. The scale for each component ranges from 0 to 255. Since brightness and color are grouped in the RGB, it is not useful for color segmentation in images without conditions of light [1]. - HLS and HSV: HLS (hue, lightness and saturation) and HSV color space are commonly used in image processing. The base color is chosen as an angle in a circular disc. The saturation is given by the distance from origin, where the saturation increases with the distance from the origin. This can be represented as a two-dimensional circular map, but when introducing the lightness, the model needs a third dimension. This is done by adding an axis perpendicular to the hue and saturation, i.e. a Z-axis. The model is called 3D-polar coordinate color space. The components is given by equations below Some studies show that HSV is invariant to highlights at white light sources, to matte surfaces, and ambient lighting. The IHLS IHLS is a modified color space presented in [19], where the I before HLS stands for improved. The idea is to use the properties of the HLS color space, but with a significant difference. In HLS and HSV, the saturation depends on the luminance. In IHLS, the saturation is defined in a way, where it does not depend on the value of the luminance [19, pp 2-8]. Saturation definitions for HLS, HSV and IHLS are shown below: 2 Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013 - Skin detection algorithm Human skin color, which depends mainly on hemoglobin and melanin concentrations and lighting conditions [16], is not located randomly within a color space. Skin colors are grouped within small areas of color space, which vary depending on the color space used [15]. Many studies have sought to develop methods for the classification of skin-color pixels using existing color spaces. One proposed method [18] for the detection of skin color in RGB color space uses simple rules to rapidly construct skin-color pixel classifiers [12]. This method uses a basic arithmetical formula to describe the relationship between the three components of RGB color: R ≥ 95 G ≥ 40 B > 20 Max{R, G, B} – min{R, G, B} > 15 |R - G| > 15 R>G R>B Other color spaces have also been used to define skin-color pixel classifiers. Areas of color space belonging to the skin-color set have been defined using the HLS, HSV and IHLS color spaces [18]: For the IHLS color space, we built a static filter from the skin distribution in Weka [4] and refined the corresponding values on test images. Finally, the following rules are adopted [19]: iHmax = 50, iHmin = 0, iSmax = 0.9, iSmin = 0.1 where iHmax and iHmin are the upper and lower boundary values for the hue component, iSmax and iSmin are the upper and lower boundary values for the saturation component of the IHLS color space. Nudity probability calculation To calculate the nudity probability, the algorithm used is primarily based on content of color. More specific, picture with larger area of skin should be assigned with bigger probability. In addition, there is an observation that pornographic pictures have many tones of skin, and skin regions in a particular picture are often close to each other. The algorithm considered these features as main clues to conclude the nudity probability of a picture. The algorithm we used in our system is modified from [19]. It consists of recognizing skin pixels in the image, locating skin regions, analyzing these regions to obtain several clues, and then assigning a probability to each picture. In the first step, to classify a pixel as skin or non-skin, algorithm requires a skin distribution model. It scanned all pictures from the upper left corner to the lower right corner, retrieving the value of each color component for all pixels. These values then transfer next process, in which each pixel is checked to determine if they satisfy the parameters for being skin. The proportion of skin pixels is considered as a primary attribute that effect the probability concluded. In the next step, after detecting skin regions, the algorithm calculates three important features that aid in the classification: the number of skin regions, the size of the largest regions and the relative positions of these regions. By combining the features obtained, the probability was finally established. In summary, the modified algorithm works in the following manner: 1. Retrieve important features - Scan the image, starting from the upper left corner to the lower right corner. - Retrieve values for red, green and blue component in RGB color space. 3 Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013 - From RGB component values, calculate other values in two color spaces (IHLS and HSV). - Label each pixel as skin or non-skin, based on three conditions (1, 2, 3). - Calculate the proportion of skin pixels in the image. The percentage obtained is presented as p1. - Locating the skin regions by identifying connected skin pixels. - Obtain the number of skin regions and get three largest skin regions: R1, R2, R3. - Calculate the percentage of the largest skin region area relative to the image size, presented as Pmax. - Construct three quadrangles with four corner of each regions: the leftmost, the uppermost, the rightmost, and the lowermost skin pixels of the three largest skin regions. - Calculate the area of the three quadrangles, S1, S2 and S3 respectively. - Calculate the number and the percentage of skin pixels within each quadrangle relative to their area, pr1, pr2 and pr3 respectively. - Calculate the average intensity of the pixels inside quadrangle, presented as A. 2. Assigned probability to an image as follows - If (p1 < 15%), assign p = p1. Otherwise, go to the next step. - If (pr1 < 35%) and (pr2 < 35%) and (pr3 < 30%) then p = (pr1 + pr2 + pr3)/3. - If (pr1 < 45%) then p = pr1/3; - If (p1 < 30%) and (pr1 + pr2 + pr3)/3 < 55%, then p = p1*(pr1 + pr2 + pr3)/3; - If (p1 > 60%) and A < 0.25 then p = A * p1. - If (p1 > 60% ) and (A > 0.4) then p = (A + p1)/2. - Otherwise, (2*pr1 + pr2 + pr3 + A)/5. In our experiment, we empirically determined the threshold values. The probability obtained can be more precise if we can detect the presence of body parts such as sexual organs. Object detection, however, is still difficult since it is not easy to distinguish algorithmically between non-sexual organs like a hand or a face and sexual objects. Nudity detection for large scale uploaded picture system Our system enables users to rapidly upload pictures in large scale (www.upanh.com). The system allows users to retrieve 200 million pictures a day. From the observation of pictures uploaded for the recent 12 months, we saw an important trend that almost pornography pictures are often uploaded together in a group. In another word, pornography pictures are stored in our system in albums of users, or uploaded batches. Consequently, we apply a proposed algorithm to detect a image album is nudity or not. To test the performance, we implement the algorithm for albums upload in our system for one day. Generally, our proposed algorithm includes two main steps. Firstly, for each group to be detected, the probability to be nudity of each images is calculated based on the first algorithm. Then, we obtained the proportion of images which have the probability larger than 50%. If this proportion is larger than 70%, we consider the group as nudity. Otherwise, it is non-nudity. We only considered images albums with view recorded more than 5.000 a day. We consider the probability calculated for each picture as its point. In another word, a picture with probability p% will have p points. The algorithm specifically works as follow: - Extract set of pictures which have the view count more than 5000 per day. - Apply the algorithm for each picture to obtain its probability to be nudity. - Extract the subset of pictures which have the point more than 30. Presented as S - With each picture of S, trace back to identify album or group containing it. Identify set of albums or groups that contain S, presented as T. (Since an item in T may include many pictures in S, cardinality of S will larger or equal than cardinality of T). 4 Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013 - Calculate point for all pictures included in each item of T. Calculate the average point of each item in T. (the average point of each album is the average of all pictures included). - For each album or group in T, if its average point more than 50: decide all the pictures in this albums as nudity. The threshold we used in this algorithm is also empirically determined.Ten accessions of Thai slicing melon which were all local varieties collected from various places were planted and used as an original population for line selection. Five S4-lines namely Bangpra White, Bangpra Green, Stripe-22, Cucumber Green and Lon-3, obtained from line selection by selfing and planting in plant-to-row were planted in Completely Randomized Design (CRD) with 10 replications to compare their immature fruit characteristics. The studied data were fruit length (L), fruit width (W), fruit shape index (L/W), fruit flesh thickness, fruit cavity length and width, fruit weight and fruit skin color. Results, Discussion and Conclusion: - Performance of probability calculation algorithm To estimate the performance of first algorithm, we apply for a test set which include 10,000 pornographic pictures. Experimental results suggest that the algorithm performs a high precision in nudity image group, with 98.77% nude images is tag with the probability more than 50. Only 19 items is tagged with point less than 30. The result is shown in Figure 1. Figure 1. Performance of probability calculation algorithm - Performance of the algorithm for upload system in large scale The experiment applied goes through 8 steps. Firstly, we extract set of 20,570 images which have the view count more than 5,000 per day (correspond to 19.8% pictures uploaded in our system). The second step is to applying the probability algorithm to extract a set of pictures which point bigger than 40. As a result, we saw that 90.11% of pictures have the point less than 40, the rest 2,036 images (9.89%), we considered as doubtful picture to be nudity or not. The proportion of set is presented in Figure 2. Figure 2. The proportion of point calculated 5 Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013 At the step 4, we trace back to identify album or group each picture is stored. In our experiment, from 2,036 pictures retrieved at step 3, we figure out the set of albums and groups that contain all of them. The set includes 498 albums which contain 5127 pictures (correspond to 24.90% pictures in the set). The next step is to examine all pictures in the album set T. Theoretically, the number of images stored in the entire T is bigger than the picture retrieved in step 1. We continue to use the probability algorithm to calculate point of each picture in the set T and average point for each album. The result is showed in the Figure 3 and Figure 4. The last procedure is to tag all the albums with high average point as nudity. Empirically, we use a threshold 50 points to detect pornography albums. The test set (20,570 images) is also manually classified in to nudity or non-nudity category, with the result that 5353 images is categorized as pornography. True positives and false positives are used to measure the performance of the proposed algorithm. On the image set, the algorithm worked with a 95.23% recall and a 0.04 % false positive rate. It correctly identified 5098 out of 5353 nude images. False detection in our model comes mainly from non-nude image uploaded in nudityalbum. Alternatively, we saw that there is no nudity picture that uploads in non-nude album in our test. Figure 2 shows the result of our test. Our work was not able to test on set by other researcher. Thus, the comparison between our method and other algorithm is not based on the same image set. However, it can be said that our model can performs a significant nudity-avoid and can be effectively apply for larger scale upload image systems. Figure 3. The distribution of album’s average points Figure 4. Distribution of picture’s points in doubtful album The pornography detection is important for Internet, but there is still no effective method for preventing improper access of such information. In this paper, we designed a simple and effective scheme on such work. Our scheme shows high total accuracy and low mis-detection. The scheme also shows its infinite ability of advance since it is integrated with probability algorithm. By the using of more accurate feature selecting, we believe that the results can be better in the future. 6 Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013 Pornography detection highly depends on the domain knowledge in this kind graph. With the help of domain knowledge, we can generate more accurate attributes for detecting pornography. In the future, we will improve the scheme by observing the trend of users. The system will enable people to upload image without signing in. We also continue to study the movement of the nudity website list, which use our system as a main storage. In addition, we will improve the accuracy of probability algorithm by distinguish the content of pornography in kid and adult pictures. Acknowledgments: This paper is funded by the Ministry of Information and Telecommunication, Vietnam. References: Pedro Monteiro da Silva Eleuterio & Mateus de Castro Polastro, “Identification of HighResolution Images of Child and Adolescent Pornography at Crime Scenes”, The International Journal of ORENSIC COMPUTER SCIENCE Lin, Y., Tseng, H. & Fuh, C. Pornography Detection Using Support Vector Machine. 16th IPPR Conference on Computer Vision, Graphics and Image Processing (CVGIP 2003). Kinmen, ROC. Y. Xu, B. Li, X. Xue, and H. Lu, “Region-based pornographic image detection,” IEEE 7th Workshop on Multimedia Signal Processing (MMSP), pp. 1–4, November 2005. Lee, J. Y., and Yoo, S. I. 2002. An elliptical boundary model for skin color detection. In Proc. of the 2002, International Conference on Imaging Science, Systems, and Technology. X. Shen, W. Wei, and Q. Qian, “The filtering of internetimages based on detecting erotogenic-part,” in Proceedings of the Third International Conference on Natural Computation (ICNC). Washington, DC, USA: IEEE Computer Society, 2007, pp. 732– 736. Jae Y. Lee and Suk I. Yoo, “An Elliptical Boundary Model for Skin Color Detection”, School of Computer Science and Engineering, Seoul National University Shilim-Dong, Gwanak-Gu, Seoul 151-742, Korea H. Zhu, S. Zhou, J. Wang, and Z. Yin, “An algorithm of pornographic image detection,” in Proceedings of the Fourth International Conference on Image and Graphics (ICIG). Washington, USA: IEEE Computer Society, 2007, pp. 801–804. J.-L. Shih, C.-H. Lee, and C.-S. Yang, “An adult image identification system employing image retrieval technique,” Pattern Recognition Letters, vol. 28, no. 16, pp. 2367–2374, 2007. M. J. Jones and J. M. Rehg, “Statistical color models with application to skin detection,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 1274–1280, 1999. W. Zeng, W. Gao, T. Zhang, and Y. Liu, “Image guarder: An intelligent detector for adult images,” in Asian Conference on Computer Vision, Jeju Island, Korea, January 2004, pp. 198–203. H. Zheng, H. Liu, and M. Daoudi, “Blocking objectionable images: adult images and harmful symbols,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), June 2004, pp. 1223–1226. Q.-F. Zheng, W. Zeng, G. Wen, and W.-Q. Wang, “Shapebased adult image detection,” in Proceedings of the Third International Conference on Image and Graphics (ICIG). Washington, DC, USA: IEEE Computer Society, 2004, pp.150–153. Elgammal, A.; Muang, C.; Hu, D. Skin Detection - a Short Tutorial. Encyclopedia of Biometrics, Verlag, 2009. Jones, M. J., Rehg, J. M. Statistical color models with application to skin detection. International. Journal of Computer Vision (IJCV), pages 81–96, 2002. 7 Rajamangala University of Technology Tawan-ok International Conference, Thailand, 29-31 May, 2013 Skin Detection using HSV color space, V. A. OLIVEIRA, A. CONCI, Wang, Y.; Yuan, B. A novel approach for human face detection from color images under complex background. Pattern Recognition 34, pages 1983–1992, 2001. Rigan Ap-apid, An Algorithm for Nudity Detection, College of Computer Studies De La Salle University Manila, Philippines. Hanbury, A and Serra, J. A 3D-polar Coordinate Colour Representation Suitable for Image Analysis. Vienna University of Technology (2004). 8