Real-Time Optimization of Viola -Jones Face Detection for Mobile Platforms l Jianfeng Ren l , Nasser Kehtarnavaz , and Leonardo Estevez 2 IDepartment of Electrical Engineering, University of Texas at Dallas; 2Wireless Terminal Business Unit, Texas Instruments ABSTRACT Face detection algorithms based on the Viola-Jones object detection approach is widely adopted in digital camera products. Due to the computational complexity of these algorithms, often a hardware coprocessor is used for their real-time operation. This paper discusses how to achieve real-time software-based implementation of these algorithms on mobile devices that have relatively limited processing and memory capabilities. Various optimization techniques are discussed and an example implementation outcome on an actual mobile platform is presented. Index Terms- Real-time face detection, mobile platform, software optimization. 1. INTRODUCTION In the last few years, a considerable amount of work has been done on face detection. Many papers on face detection have appeared in the literature. The existing face detection algorithms may be divided into two main approaches. The first approach is based on utilization of skin color [1]. Such algorithms require a color correction procedure to compensate for light source variations. The second approach is based on utilization of facial features [2]. Although algorithms based on facial features are relatively more accurate, their computational complexity and memory requirement are quite high and their practical real-time implementation on mobile devices has already been achieved only via dedicated coprocessors. Most face detection algorithms run on PCs with relatively powerful CPUs and large memory sizes. However, when it comes to mobile devices, due to their relatively limited processing and memory capabilities, one cannot run computationally intensive image processing algorithms in real-time without performing appropriate software optimization. Among the face-based detection algorithms, the one based on the Viola-Jones object detection approach [3] has been shown to be most robust to environmental lighting changes and thus it has been implemented in hardware in digital camera products. In [4], the OpenCV version of the Viola-Jones face detection is provided. In this paper, we present a software-based implementation of the Viola-Jones face detection algorithm. Due to the computational complexity of a software-based solution, we have considered a number of optimization steps to be able to run this algorithm in real-time on resource 978-1-4244-2956-1/08/$25.00 ©2008 IEEE limited mobile devices. As an implementation example, the Texas Instruments OMAP platform is considered. This platform is the adopted engine in many modem mobile phones. The rest of the paper is organized as follows. An overview of the face detection algorithm using the ViolaJones approach is provided in section 2. The software optimization steps are discussed in section 3. Experimental results and discussion are then stated in section 4 and the conclusions in section 5. 2. OVERVIEW OF VIOLA-JONES FACE DETECTION ALGORITHM As per the Viola-Jones approach, in order to detect an object, a trained classifier based on the cascaded Adaboost algorithm is used across a number of subimages. The first stage of this algorithm involves training a cascaded classifier, which is then used for detecting faces during the detection stage. The training process consumes an enormous amount of time, e.g. hours/days on a modem PC. The OpenCV version of this algorithm provides the classifier parameters that get written into an XML file. Here, we are using the trained parameters previously reported in the OpenCV version. Due to the lack of space, the details of the training process are not mentioned here and the interested reader is referred to [3] and [4]. For detection, a so called integral image for the entire image frame is computed. Then, each subimage with different positions and sizes is tested against all trees/stages in the classifier. Figure 1 provides an overview illustration of the algorithm. First, the classifier parameters from the XML file is read into one data structure such as a binary tree or an array. In the implementation reported in this paper, we used the classifier parameters for frontal view faces. It should be mentioned that for profile faces or other face orientations, corresponding classifier parameters can be used. The classifier selected for frontal view faces consists of 22 stages with each stage comprising different numbers of trees ranging from 3 to 212. For each subimage to be examined, its corresponding features are computed. Viola and Jones proposed four different rectangular features within a subimage as shown in Figure 2. During the training process, the number of rectangular features within one 24x24 block is about 18,000. After training, each tree does the comparison for one rectangular feature. Therefore, during each stage, each tree is applied to the subimage under testing. Test Sub Image Image Tree 1 Tree 1 Tree 1 Tree 2 Tree 2 Tree 2 Tree 3 Tree 18 Tree 212 True Tru Training XML file True True Stage_sum>T1 Fa e found Each tree corresponds to one feature False No Face found False No Face found False No Face found Figure 1: Viola-Jones face detection algorithm. This will generate one value to be compared with a threshold of that tree. If the value is less than the threshold of that tree, the left value of the tree gets accumulated. Otherwise, the right value gets accumulated. For each stage, if the stage sum is less than the stage threshold (T#, # indicates the number of stages appearing in Figure 1), then the testing ends indicating that the tested subimage does not contain any face. Otherwise, the process continues to go through all the trees/stages until the last one. If one subimage can go through all the stages and the final result is 1, this indicates the submimage is a face. IJ 3. REAL-TIME OPTIMIZATION VIOLA-JONES DETECTION In what follows, we provide the software optimization steps that we considered to allow the real-time implementation of the above face detection algorithm. These optimizations are mentioned in the order of their computational time reduction with the most time reduction optimization step stated first. At this point, it is worth mentioning that as stated in [6] these optimizations are general purpose in the sense that they can also be applied to other computationally intensive image processing algorithms that are desired to be run in real-time on mobile platforms. 3.1 Optimization A - Data reduction i. Spatial subsampling: By spatial subsampling ii. Figure 2: Viola-Jones rectangular features for a tested subimage. images, it is possible to gain much gain in computation due to data reduction. In our implementation, we started with VGA resolution images for captured video and reduced the size to QVGA for processing. Step size: In the original algorithm, each subimage is shifted one pixel at a time. Additional data reduction was achieved by shifting the subimages by two pixels. No Yes Faces Size>40x40? Figure 3: Real-time optimization steps for Viola-Jones face detection on mobile platforms. iii. Scale size: The original subimage size is 20x20 and with the scale factor of 1.1. That is to say the subimages of size 20x20 are used to scan the entire image from left to right and from top to bottom during the first round. During the next round, the size of subimages is increased to 22x22. Again, for further reduction of data, we increased the scale size to 1.3 in our implementation. IV. Minimum face size: By defining a minimum face size, one can stop the detection when face sizes lose their practical significance. In our implementation, we considered a minimum face size of 30x30. 3.2 Optimization B -Search reduction v. Utilization of key frame and Narrowed detection area: To limit the amount of search, the concept of a key frame is introduced to synchronize face detection with face tracking. In our implementation, one frame was labeled as the key frame every 30 frames. Generally speaking, face detection takes more time than face tracking. But if a detected face size is large than 40x40, it is time consuming to perform face tracking. In such situation, face tracking is avoided. Our face tracking is done using the SAD (Sum of Absolute Difference) approach. If faces are detected in key frames, and at the same time the face size is larger than 40x40, then during subsequent frames the detection is done around a surrounding area. 3.3 Optimization C - Numerical reduction vi. Fixed point processing: Noting that each subimage is checked over all the trees/stages, when one subimage is flagged as a face, it must go through 2135 trees. Each tree involves various summations, multiplications and divisions. Normally, these computations are done with numbers in the floating-point format. The great majority of mobile devices are fixed-point devices and it is quite inefficient to perform floating-point computation on fixed-point processors. During this optimization type, we redid all the computations using the fixed-point Q format. In addition to the above optimizations, a display buffer was utilized to continuously draw a rectangular graphics overlay around the largest detected face. Figure 3 shows the optimization steps applied for the purpose of achieving a real-time throughput. These optimization steps can also be used for similar types of algorithms. As seen in Figure 3, initially the tracking is disabled. Then, it is checked whether the current frame is a key frame or not. If the current frame is a key frame, the entire frame is examined based on the Viola-Jones approach. If a face is detected in a key frame, tracking gets activated depending on the size of the detected face. If the face size is large, the face detection in the next frame is done within the surrounding area. If the face size is small, the tracking is done based on SAD. 4. REAL-TIME IMPLEMENTATION RESULTS In this section, the above optimization steps are actually put to test by performing an actual implementation on the Texas Instruments OMAP mobile platform. We selected this processor as it is a widely adopted processor in many modem cell-phones. This processor is a triple core engine consisting of an ARM Cortex-A8 processor, a graphics processor, and a C6400 DSP processor. Figure 4 shows a snapshot of the Viola-Jones face detection running in realtime on the OMAP3430 mobile device. As shown in [3], the detection accuracy is more than 99% for frontal view faces. As far as processing time is concerned, Table 1 lists the gain in the processing time by applying one set of optimizations without considering the other sets of optimizations to four different video clips. From Table 1, it can be seen that data reduction with QVGA resolution reduces the processing time by about 90%. Another major reduction in processing time is due to the fixed-point processing, generating about 3 to 5 times speedup. The last contributor is due to the narrowed search and face tracking. Table 2 lists the processing time reduction in an incremental fashion for four different video clips, indicating an average processing rate of at least 15 frames per second. Here it is worth mentioning that, in general, in mobile systems, further speedup is gained by implementing computationally intensive algorithms such as the algorithm discussed in this paper on integrated coprocessors including DSPs. Video clip 1 Video clip 2 Video clip 3 Video clip 4 Average No Optimization 158.35 148.51 169.07 165.05 158.64 A only B only Conly 1.59 0.91 0.93 0.90 1.14 29.90 28.49 30.31 28.68 29.57 39.14 31.71 39.07 37.00 36.64 Figure 4: Snapshot of face detection running in realtime on OMAP3430 mobile platform. Table 1: Face detection time per frame averaged over 100 frames for the three sets of optimizations (in seconds). Table 2: Face detection time per frame averaged over 100 frames for different optimizations in incremental fashion (in seconds). Video clip 1 Video clip 2 Video clip 3 Video clip 4 Average No Optimization 158.35 148.51 169.07 165.05 158.64 i i&ii i through iii i through iv i through v i through vi 11.64 10.96 11.54 11.04 11.38 2.92 2.80 2.93 2.82 2.88 1.22 1.17 1.19 1.15 1.19 0.97 0.91 0.93 0.90 0.94 0.87 0.26 0.31 0.29 0.48 0.15 0.14 0.08 0.14 0.12 7. REFERENCES 5. CONCLUSION 1. In this paper, various optimization steps are introduced in order to be able to run the popular and widely used ViolaJones face detection algorithm in real-time on mobile devices. It is shown that by appropriately reducing data and the amount of search and by performing the computation in fixed-point, a real-time throughput can be achieved by merely taking a software approach without using any dedicated hardware coprocessor. 2. 3. 4. 6. ACKNOWLEDGEMENT 5. This work was sponsored by Texas Instruments. Special thanks to Mr. Shravan Suryanarayana for his assistance with the mobile platform and Dr. Dmit Batur for the helpful discussions. 6. R. Hsu, M. Mottaleb and A. Jain, "Face Detection in color images," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 696-707, May 2005. K. Yow and R. Cipolla, "Feature-based Human Face Detection," Image and Vision Computing, vol. 15, no. 9, pp. 713-735, September 1997. P. Viola and M. Jones, "Rapid Object Detection Using a Boosted Cascade of Simple Features," Proc. IEEE CVPR, 2001. OpenCV[online] http://www.intel.com/technology/computing/opencv/overview .htm B. Kisacanin, "Integral Image Optimizations for Embedded Vision Applications: Image Analysis and Interpretation," Proceedings of IEEE Southwest Symposium on Image Analysis and Interpretation, Santa Fe, March 2008. N. Kehtarnavaz and M. Gamadia, Real-Time Image and Video Processing: From Research to Reality, Morgan and Claypool Publishers, 2006.