Towards Hands-Free Interfaces Based on Real-Time Robust Facial Gesture Recognition Cristina Manresa-Yee, Javier Varona, and Francisco J. Perales Universitat de les Illes Balears Departament de Matemàtiques i Informàtica Ed. Anselm Turmeda. Crta. Valldemossa km. 7.5 07122 Palma {cristina.manresa, xavi.varona, paco.perales}@uib.es Abstract. Perceptual user interfaces are becoming important nowadays, because they offer a more natural interaction with the computer via speech recognition, haptics, computer vision techniques and so on. In this paper we present a visual-based interface (VBI) that analyzes users’ facial gestures and motion. This interface works in real-time and gets the images from a conventional webcam. Due to this, it has to be robust recognizing gestures in webcam standard quality images. The system automatically finds the user’s face and tracks it through time for recognizing the gestures within the face region. Then, a new information fusion procedure has been proposed to acquire data from computer vision algorithms and its results are used to carry out a robust recognition process. Finally, we show how the system is used to replace a conventional mouse for human computer interaction. We use the head’s motion for controlling the mouse’s motion and eyes winks detection to execute the mouse’s events. 1 Introduction The research of new human-computer interfaces has become a growing field in computer science, which aims to attain the development of more natural, intuitive, unobtrusive and efficient interfaces. This objective has come up with the concept of Perceptual User Interfaces (PUIs) that are turning out to be very popular as they seek to make the user interface more natural and compelling by taking advantage of the ways in which people naturally interact with each other and with the world. PUIs can use speech and sound recognition (ARS) and generation (TTS), computer vision, graphical animation and visualization, language understanding, touch-based sensing and feedback (haptics), learning, user modeling and dialog management [18]. Of all the communication channels through where interface information can travel, computer vision provides a lot of information that can be used for detection and recognition of human’s actions and gestures, which can be analyzed and applied to interaction purposes. When sitting in front of a computer and with the use of webcams, very common devices nowadays, heads and faces can be assumed to be visible. Therefore, system’s based in head or face feature detection and tracking, and face gesture or expression recognition can become very effective human-computer interfaces. Of course, difficulties can arise from in-plane (tilted head, upside down) and out-of-plane (frontal view, side view) rotations of the head, facial hair, glasses, lighting variations and F.J. Perales and R.B. Fisher (Eds.): AMDO 2006, LNCS 4069, pp. 504 – 513, 2006. © Springer-Verlag Berlin Heidelberg 2006 Towards Hands-Free Interfaces Based on Real-Time Robust Facial Gesture Recognition 505 cluttered background [14]. Besides, when using standard USB webcams, the provided CMOS image resolution has to be taken in account. Different approaches have been used for non invasive face/head-based interfaces. For the control of the position some systems analyze facial cues such as color distributions, head geometry or motion [5, 17]. Other works track facial features [10, 3] or gaze including infrared lighting [13, 15]. To recognize the user’s events it is possible to use facial gesture recognition. In this paper we consider as facial gestures the atomic facial feature motions such as eye blinking [9, 11, 12], winks or mouth opening. Other systems contemplate the head gesture recognition that implies overall head motions or facial expression recognition that combines changes of the mentioned facial features to express an emotion [8]. In this work, we present a visual-based interface (VBI) that uses face feature tracking and facial gesture recognition. In order to achieve this function, the system’s feedback must be in real-time and it must be precise and robust. A standard USB webcam will provide the images to process; therefore it will allow the achievement of a low cost system. Finally, the last system‘s requirements is that the user’s work environment conditions should be normal (office, house or indoor environments), that is, with no special lighting or static background. The paper is organized as follows. In the next section we describe in general terms the system. Section 3 explains the learning process of the user’s facial features. Then, in section 4, we explain how to estimate through time the facial features positions. The facial gesture recognition process for detecting eye winks is detailed in section 5. And finally in the last section, a system application is presented: a mouse replacement, and the overall work conclusions 2 System Overview To achieve an easy and friendly-use perceptual user interface, the system is composed of two main modules: Initialization and Processing (see Fig. 1). The Initialization module is responsible of extracting the user’s distinctive facial features. This process locates the user’s face, learns his skin color and detects the initial facial feature locations and their properties such as appearance and color. Moreover, this process is completely automatic, and it can be considered as a learning process of the user’s facial features. The chosen facial features are the nose for head tracking and the eyes for gesture recognition. We decided to use the nose as feature to track, because it is almost always visible in all positions of the head facing the screen and it is not occluded by beards, moustaches or glasses [10]. For the gesture recognition module, the main gestures to control were the eyes winks from the right or left eye. The selected facial features’ positions are robustly estimated through time by two tasks: nose tracking based on Lucas and Kanade’s algorithm and eye tracking by means of color distributions. It is important to point out that the system is able to react when the features get lost, detecting when it occurs and restarting the system calling to the Initialization module. Finally, there is the possibility of adding more gestures to the system if the head motions are taken in account [20] for building a higher level human-computer communication. 506 C. Manresa-Yee, J. Varona, and F.J. Perales Initial Frames Initialization NO Best features to track? Frames YES YES Tracking Position Lost features? NO Gesture Recognition Event Processing Fig. 1. The system is divided in two main modules: Initialization and processing 3 Learning the User’s Facial Features As it was remarked in the PUI’s definition, it is very important for the interface to be natural; consequently, the system shouldn’t require any calibration process where the user interferes. To accomplish this necessity, the system detects automatically the user’s face by means of a real-time face detection algorithm [19]. When the system is first executed, the user must stay steady for a few frames for the process to be initialized. Face detection will be considered robust when during a few frames the face region is detected without changes (see Fig 2. (a)). Then, it is possible to define the initial user’s face region to start the search of the user’s facial features. Based on anthropometrical measurements, the face region can be divided in three sections: eyes and eyebrows, nose, and mouth region. Over the nose region, we look for those points that can be easily tracked, that is, those whose derivative energy perpendicular to the prominent direction is above a threshold [16]. This algorithm theoretically selects the nose corners or the nostrils. However, the ambient lighting can cause the selection of points that are not placed over the desired positions; this fact is clearly visible in Fig. 2 (b). Ideally the desired selected points should be at both sides of the nose and with certain symmetrical conditions. Therefore, an enhancement and a re-selection of the found features must be carried out having in account symmetrical constraints. Fig. 2 (c) shows the selected features that we consider due to their symmetry respect to the vertical axis. This reselection process will achieve the best features to track and it will contribute to the tracking robustness. Fig. 2 (d) illustrates the final point considered, that is, the mean Towards Hands-Free Interfaces Based on Real-Time Robust Facial Gesture Recognition (a) (b) (c) (d) 507 Fig. 2. (a) Automatic face detection. (b) Initial set of features. (c) Best feature selection using symmetrical constraints. (d) Mean of al features: nose point. point of all the final selected features that due to the reselection of points will be centered on the nose. The user’s skin color is the next feature to be learnt. This feature will help the tracking and gesture recognition by constraining the processing to the pixels classified inside the skin mask. In order to learn the skin color, the pixels inside the face region are used as color samples for building the learning set. A Gaussian model in 3D RGB is chosen to represent the skin color probability density function due to its good results in practical applications [1]. The values of the Gaussian model parameters (mean and covariance matrix) are computed from the sample set using standard maximum likelihood methods [4]. Once calculated the model, the probability of a new pixel being skin can be computed for creating a “skin mask” of the user’s face, see Fig. 3 Fig. 3. Skin masks for different users The last step of the Initialization phase is to build the user’s eyes models. Using the eyes and eyebrows region found in the face detection phase both eyes can be located. First, the region is binarized to find the dark zones, and then we keep the bounding boxes of the pair of blobs that are symmetrical and are located nearer to the nose region. This way, the eyebrows or the face borders should not be selected. In Fig. 4, 508 C. Manresa-Yee, J. Varona, and F.J. Perales an example of eyes’ detection is shown. In the next section the eyes tracking based on their color distribution is explained. This fact is justified on the idea that eye color is different to the other facial features color (taking in account that the eye color distribution is composed by sclera and iris colors). Like this, our system could be used by users with clear (blue or green) or dark (black or brown) eyes. Eye models are obtained through histogramming techniques in the RGB color space of the pixels belonging to the detected eye regions. Fig. 4. Example of eyes’ detection: the blobs that are selected (in red color) are symmetrical and are nearer to the nose region. The model histograms are produced with the function b(xi), which assigns the color at location xi to the corresponding bin. In our experiments, the histograms are calculated in the RGB space using 16 x 16 x 16 bins. We increase the reliability of the color distribution applying a weight function to each bin value depending of the distance between the pixel location and the eye center. 4 Facial Features Tracking The facial feature tracking process consists in two tasks: eye and nose tracking. As we said before, eye tracking is based on its color distribution. By weighting the eye model by an isotropic kernel makes it possible to use a gradient optimization function, such as the mean-shift algorithm, to search each eye model in the new frame. Practical details and a discussion about the complete algorithm are in [6]. In our implementation this algorithm performs well and in real time. It is important to comment that small positional errors could occur. However, it is not important because the eye tracking results are only used to define the image regions where the gesture recognition process is performed. Besides, to add robustness to this process we only consider as search region those pixels belonging to the skin mask. The important positional results for our system are reported by the nose tracking algorithm, where the selected features in the Initialization process are used. In this case, the spatial intensity gradient information of the images is used for finding the best image registration [2]. As it was before mentioned, for each frame the mean of all features is computed and it is defined as the nose position for that frame. The tracking algorithm is robust for handling rotation, scaling and shearing, so the user can move in a more unrestricted way. But again lighting or fast movements can cause the lost or displacement of the features to track. As only the features beneath the nose region are in the region of interest, a feature will be discarded when the length between this feature and the mean position, the nose position, is greater than a predefined value. Towards Hands-Free Interfaces Based on Real-Time Robust Facial Gesture Recognition 509 In theory, it would be possible to use Kalman filters for smoothing the positions. However, Kalman filters are not suited in our case because they don’t achieve good results with erratic movements such as the face motion [7]. Therefore, our smoothing algorithm is based in the motion’s tendency of the nose positions (head motion). A linear regression method is applied to a number of tracked nose positions through consecutive frames. The computed nose points of n consecutive frames are adjusted to a line, and therefore the nose motion can be carried out over that line direction. For avoiding discontinuities the regression line is adjusted with every new point that arrives. Several frames of the tracking sequences are shown in Fig 5. 5 Gesture Recognition The gestures considered in this work are eye winks. The major part of the works use high quality images and good image resolution in the eyes zones. However, wink recognition with webcam quality images is difficult. Besides, this process depends on the user’s head position. Therefore, our wink detection process is based on a search of the iris contours. That is, if the iris contours are detected in the image the eye will be considered as open, if not, the eye will be considered closed. It is important to point out that this process is robust because it is only carried out in the tracked eye regions by the mean-shift procedure described before. Fig. 5. Facial feature tracking results 510 C. Manresa-Yee, J. Varona, and F.J. Perales The process starts detecting the vertical contours in the image. For avoiding false positives in this process, the vertical contours are logically operated with a mask which was generated by thresholding the original image. Finally we keep the two longest vertical edges of each eye region if they appear to get the eye candidates. If these two vertical edges which correspond to the eye iris edges don’t appear after the process for a number of consecutive frames, for gesture consistency, we will assume that the eye is closed. In Fig. 6 the process for gesture recognition is described. (a) (b) (c) Fig. 6. Process for recognizing winks. The first row shows the process applied to open eyes. The second row represents the process over closed eyes. (a) Original image. (b) Vertical edges, (c) Iris contours. 5 HeadDev Using the described techniques in previous sections, a functional perceptual interface has been implemented. This application consists in achieving a system that fulfills completely the functions of a standard mouse and replaces it by means of face feature tracking and face gesture recognition. A highlight of this system is its potential users. Since the use of PUIs can help in eInclusion and e-Accessibility issues, the system can offer assistive technology for people with physical disabilities, which can help them to lead more independent lives and to any kind of audience, they contribute to new and more powerful interaction experiences. So, its use is focused on users with physical limitations in hands or arms or motion difficulties in upper limbs that can not use a traditional mouse. Other uses serve to entertainment and leisure purposes, such as computer games or exploring immersive 3D graphic worlds [5]. By means of the nose tracking process, HeadDev can simulate the mouse’s motion. The precision required should be sufficient for controlling the mouse’s cursor to the desired position. To reproduce the mouse motion it can be done through two different forms: absolute and relative. In the absolute type, the position would be mapped directly onto the screen, but this type would require a very accurate tracking, since a small tracking error in the image would be magnified on the screen. Therefore, we use relative motion for controlling the mouse’s motion, which is not so sensitive to the tracking accuracy, since the cursor is controlled by the relative motion of the nose in the image. The relative type yields smoother movements of the cursor, due to the nonmagnification of the tracking error. Then, if nt=(xt,yt) is the new nose tracked position for the frame t, to compute the new mouse screen coordinates, st, we apply Towards Hands-Free Interfaces Based on Real-Time Robust Facial Gesture Recognition s t = s t −1 + α (n t − n t −1 ) , 511 (1) where α is a predefined constant that depends on the screen size and translates the image coordinates to screen coordinates. The computed mouse screen coordinates are sent to the system as real mouse inputs for placing the cursor in the desired position. Finally, to represent the mouse’s right or left click events, we control winks from the right or left eye respectively by means of the previously described gesture recognition process. To evaluate the application’s performance, HeadDev was tested by a set 22 users where half of them had never experienced with the application and the other half had previously trained for a short period with the interface. A 5 x 5 point grid was presented in the computer screen where the user had to try clicking on every point; each point had a radius of 15 pixels. While the user performed the test task, distance data between the mouse’s position click and the nearest point in the grid was stored to study the accuracy. The error distance is the distance in pixels of the faulty clicked positions (clicks that weren’t performed on the targets). In Table 1 the performance evaluation results are summarized. Table 1. Summary of the performance evaluation Users Group Trained Novel Recognized clicks 97,3 % 85,9 % Mean distance of errors 2 pixels 5 pixels The experiments have confirmed that continuous training of the users results in higher skills and, thus, better performances and accuracy for controlling the mouse position. Besides, a fact to take in account is that this test can produce some neck fatigue over some users; therefore, some errors clicking the point grid could be caused due to this reason. 6 Conclusions and Future Work In this paper we have proposed a new mixture of several computer vision techniques, where some of them have been improved and enhanced to reach more stability and robustness in tracking and gesture recognition. Numerical and visual results are given. In order to build reliable and robust perceptual user interfaces based on computer vision, certain practical constraints must be taken in account: the application must be robust to work in any environment and to use images from low cost devices. In this paper we present a VBI system that accomplishes these constraints. As a system application, we present an interface that is able to replace the standard mouse motions and events. Currently, the system has been tested by several disabled people (cerebral paralysis and physical disabilities) with encouraging results. Of course, more improvements have to be done, including more gestures (equivalents with BLISS commands or other kind of language for disabled persons), sound (TTS and ARS) and adaptive learning capabilities for specific disabilities. Enhancements have been 512 C. Manresa-Yee, J. Varona, and F.J. Perales planned as future work, such as including a bigger set of head and face gestures in order to support current computer interactions such as web surfing. HeadDev for Microsoft Windows is available under a freeware license in the Web page http://www.tagrv.com. This will allow us to test the application by users around the world and we will able to improve the results by analyzing their reports. In near future, a linux version will be also available. Acknowledgements This work has been subsidized by the national project TIN2004-07926 from the MCYT Spanish Government and TAGrv S.L., Fundación Vodafone, Asociación de Integración de Discapacitados en Red. Javier Varona acknowledges the support of a Ramon y Cajal grant from the Spanish MEC. References 1. Alexander, D.C., Buxton, B.F.: Statistical Modeling of Colour Data. International Journal of Computer Vision 44 (2001) 87–109 2. Baker, S., Matthews, I.: Lucas-Kanade 20 Years On: A Unifying Framework. International Journal of Computer Vision 56 (2004) 221–225 3. Betke, M., Gips J., Fleming, P.: The Camera Mouse: Visual Tracking of Body Features to Provide Computer Access for People with Severe Disabilities. IEEE Transactions on neural systems and Rehabilitation Engineering 10 (2002) 4. Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press (1995) 5. Bradski, G.R.: Computer Vision Face Tracking as a Component of a Perceptual User Interface. In: Proceedings of the IEEE Workshop on Applications of Computer Vision (1998) 214–219 6. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based Object Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (2003) 564–577 7. Fagiani, C., Betke, M., Gips, J.: Evaluation of Tracking Methods for Human-Computer Interaction. In: Proceedings of the IEEE Workshop on Applications in Computer Vision (2002) 121–126 8. Fasel, B., Luettin, J.: Automatic Facial Expression Analysis: A Survey. Pattern Recognition 36 (2003) 259–275 9. Gorodnichy, D.O.: Towards automatic retrieval of blink-based lexicon for persons suffered from brain-stem injury using video cameras. In: Proceedings of the IEEE Computer Vision and Pattern Recognition, Workshop on Face Processing in Video (2004) 10. Gorodnichy, D.O., Malik, S., Roth, G.: Nouse ‘Use Your Nose as a Mouse’ – a New Technology for Hands-free Games and Interfaces. Image and Vision Computing 22 (2004) 931–942 11. Grauman, K., Betke, M., Gips, J., Bradski, G.: Communication via Eye Blinks Detection and Duration Analysis in Real Time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2001) 12. Grauman, K., Betke, M., Lombardi, J., Gips, J.,Bradski G.R.: Communication via eye blinks and eyebrow raises: video-based human-computer interfaces Video-Based HumanComputer Interfaces. Universal Access in the Information Society 2 (2003) 359–373 13. EyeTech Quick Glance, http://www.eyetechds.com/qglance2.htm (2006) Towards Hands-Free Interfaces Based on Real-Time Robust Facial Gesture Recognition 513 14. Kölsch, M., Turk, M.: Perceptual Interfaces. In Medioni, G., Kang S.B.(eds): Emerging Topics in Computer Vision, Prentice Hall (2005) 15. Morimoto, C., Mimica, M.: Eye gaze tracking techniques for interactive applications. Computer Vision and Image Understanding 98 (2005) 4–24 16. Shi, J., and Tomasi, C.: Good Features to Track. In: Processdings of the IEEE Conference on Computer Vision and Pattern Recognition (1994) 593–600 17. Toyama, K.: Look, Ma – No Hands!”Hands-Free Cursor Control with Real-Time 3D Face Tracking. In Proceedings of the Workshop on Perceptual User Interfaces (1998) 49–54 18. Turk, M., Robertson, G.: Perceptual User Interfaces. Communications of the ACM 43 (2000) 32–34 19. Viola, P., Jones, M.: Robust Real-Time Face Detection. International Journal of Computer Vision 57 (2004) 137–154 20. Zelinsky, A., Heinzmann, J.: Real-Time Visual Recognition of Facial Gestures for Human-Computer Interaction. In: Proceedings of the IEEE Automatic Face and Gesture Recognition (1996) 351–356