Towards Hands-Free Interfaces Based on Real

advertisement
Towards Hands-Free Interfaces Based on Real-Time
Robust Facial Gesture Recognition
Cristina Manresa-Yee, Javier Varona, and Francisco J. Perales
Universitat de les Illes Balears
Departament de Matemàtiques i Informàtica
Ed. Anselm Turmeda. Crta. Valldemossa km. 7.5 07122 Palma
{cristina.manresa, xavi.varona, paco.perales}@uib.es
Abstract. Perceptual user interfaces are becoming important nowadays, because they offer a more natural interaction with the computer via speech recognition, haptics, computer vision techniques and so on. In this paper we present a
visual-based interface (VBI) that analyzes users’ facial gestures and motion.
This interface works in real-time and gets the images from a conventional webcam. Due to this, it has to be robust recognizing gestures in webcam standard
quality images. The system automatically finds the user’s face and tracks it
through time for recognizing the gestures within the face region. Then, a new
information fusion procedure has been proposed to acquire data from computer
vision algorithms and its results are used to carry out a robust recognition process. Finally, we show how the system is used to replace a conventional mouse
for human computer interaction. We use the head’s motion for controlling the
mouse’s motion and eyes winks detection to execute the mouse’s events.
1 Introduction
The research of new human-computer interfaces has become a growing field in computer science, which aims to attain the development of more natural, intuitive, unobtrusive and efficient interfaces. This objective has come up with the concept of
Perceptual User Interfaces (PUIs) that are turning out to be very popular as they seek to
make the user interface more natural and compelling by taking advantage of the ways in
which people naturally interact with each other and with the world. PUIs can use speech
and sound recognition (ARS) and generation (TTS), computer vision, graphical animation and visualization, language understanding, touch-based sensing and feedback (haptics), learning, user modeling and dialog management [18]. Of all the communication
channels through where interface information can travel, computer vision provides a lot
of information that can be used for detection and recognition of human’s actions and
gestures, which can be analyzed and applied to interaction purposes.
When sitting in front of a computer and with the use of webcams, very common
devices nowadays, heads and faces can be assumed to be visible. Therefore, system’s
based in head or face feature detection and tracking, and face gesture or expression
recognition can become very effective human-computer interfaces. Of course, difficulties can arise from in-plane (tilted head, upside down) and out-of-plane (frontal
view, side view) rotations of the head, facial hair, glasses, lighting variations and
F.J. Perales and R.B. Fisher (Eds.): AMDO 2006, LNCS 4069, pp. 504 – 513, 2006.
© Springer-Verlag Berlin Heidelberg 2006
Towards Hands-Free Interfaces Based on Real-Time Robust Facial Gesture Recognition
505
cluttered background [14]. Besides, when using standard USB webcams, the provided
CMOS image resolution has to be taken in account.
Different approaches have been used for non invasive face/head-based interfaces.
For the control of the position some systems analyze facial cues such as color distributions, head geometry or motion [5, 17]. Other works track facial features [10, 3] or
gaze including infrared lighting [13, 15]. To recognize the user’s events it is possible
to use facial gesture recognition. In this paper we consider as facial gestures the
atomic facial feature motions such as eye blinking [9, 11, 12], winks or mouth opening. Other systems contemplate the head gesture recognition that implies overall head
motions or facial expression recognition that combines changes of the mentioned
facial features to express an emotion [8].
In this work, we present a visual-based interface (VBI) that uses face feature tracking and facial gesture recognition. In order to achieve this function, the system’s feedback must be in real-time and it must be precise and robust. A standard USB webcam
will provide the images to process; therefore it will allow the achievement of a low
cost system. Finally, the last system‘s requirements is that the user’s work environment conditions should be normal (office, house or indoor environments), that is, with
no special lighting or static background.
The paper is organized as follows. In the next section we describe in general terms
the system. Section 3 explains the learning process of the user’s facial features. Then,
in section 4, we explain how to estimate through time the facial features positions.
The facial gesture recognition process for detecting eye winks is detailed in section 5.
And finally in the last section, a system application is presented: a mouse replacement, and the overall work conclusions
2 System Overview
To achieve an easy and friendly-use perceptual user interface, the system is composed
of two main modules: Initialization and Processing (see Fig. 1). The Initialization
module is responsible of extracting the user’s distinctive facial features. This process
locates the user’s face, learns his skin color and detects the initial facial feature locations and their properties such as appearance and color. Moreover, this process is
completely automatic, and it can be considered as a learning process of the user’s
facial features. The chosen facial features are the nose for head tracking and the eyes
for gesture recognition. We decided to use the nose as feature to track, because it is
almost always visible in all positions of the head facing the screen and it is not occluded by beards, moustaches or glasses [10]. For the gesture recognition module, the
main gestures to control were the eyes winks from the right or left eye.
The selected facial features’ positions are robustly estimated through time by two
tasks: nose tracking based on Lucas and Kanade’s algorithm and eye tracking by
means of color distributions. It is important to point out that the system is able to react
when the features get lost, detecting when it occurs and restarting the system calling
to the Initialization module.
Finally, there is the possibility of adding more gestures to the system if the head
motions are taken in account [20] for building a higher level human-computer communication.
506
C. Manresa-Yee, J. Varona, and F.J. Perales
Initial Frames
Initialization
NO
Best features to
track?
Frames
YES
YES
Tracking
Position
Lost features?
NO
Gesture
Recognition
Event
Processing
Fig. 1. The system is divided in two main modules: Initialization and processing
3 Learning the User’s Facial Features
As it was remarked in the PUI’s definition, it is very important for the interface to be
natural; consequently, the system shouldn’t require any calibration process where the
user interferes. To accomplish this necessity, the system detects automatically the
user’s face by means of a real-time face detection algorithm [19].
When the system is first executed, the user must stay steady for a few frames for
the process to be initialized. Face detection will be considered robust when during a
few frames the face region is detected without changes (see Fig 2. (a)). Then, it is
possible to define the initial user’s face region to start the search of the user’s facial
features. Based on anthropometrical measurements, the face region can be divided in
three sections: eyes and eyebrows, nose, and mouth region.
Over the nose region, we look for those points that can be easily tracked, that is,
those whose derivative energy perpendicular to the prominent direction is above a
threshold [16]. This algorithm theoretically selects the nose corners or the nostrils.
However, the ambient lighting can cause the selection of points that are not placed
over the desired positions; this fact is clearly visible in Fig. 2 (b). Ideally the desired
selected points should be at both sides of the nose and with certain symmetrical conditions. Therefore, an enhancement and a re-selection of the found features must be
carried out having in account symmetrical constraints. Fig. 2 (c) shows the selected
features that we consider due to their symmetry respect to the vertical axis. This reselection process will achieve the best features to track and it will contribute to the
tracking robustness. Fig. 2 (d) illustrates the final point considered, that is, the mean
Towards Hands-Free Interfaces Based on Real-Time Robust Facial Gesture Recognition
(a)
(b)
(c)
(d)
507
Fig. 2. (a) Automatic face detection. (b) Initial set of features. (c) Best feature selection using
symmetrical constraints. (d) Mean of al features: nose point.
point of all the final selected features that due to the reselection of points will be centered on the nose.
The user’s skin color is the next feature to be learnt. This feature will help the
tracking and gesture recognition by constraining the processing to the pixels classified
inside the skin mask. In order to learn the skin color, the pixels inside the face region
are used as color samples for building the learning set. A Gaussian model in 3D RGB
is chosen to represent the skin color probability density function due to its good results in practical applications [1]. The values of the Gaussian model parameters (mean
and covariance matrix) are computed from the sample set using standard maximum
likelihood methods [4]. Once calculated the model, the probability of a new pixel
being skin can be computed for creating a “skin mask” of the user’s face, see Fig. 3
Fig. 3. Skin masks for different users
The last step of the Initialization phase is to build the user’s eyes models. Using the
eyes and eyebrows region found in the face detection phase both eyes can be located.
First, the region is binarized to find the dark zones, and then we keep the bounding
boxes of the pair of blobs that are symmetrical and are located nearer to the nose
region. This way, the eyebrows or the face borders should not be selected. In Fig. 4,
508
C. Manresa-Yee, J. Varona, and F.J. Perales
an example of eyes’ detection is shown. In the next section the eyes tracking based on
their color distribution is explained. This fact is justified on the idea that eye color is
different to the other facial features color (taking in account that the eye color distribution is composed by sclera and iris colors). Like this, our system could be used by
users with clear (blue or green) or dark (black or brown) eyes. Eye models are obtained through histogramming techniques in the RGB color space of the pixels belonging to the detected eye regions.
Fig. 4. Example of eyes’ detection: the blobs that are selected (in red color) are symmetrical
and are nearer to the nose region.
The model histograms are produced with the function b(xi), which assigns the color
at location xi to the corresponding bin. In our experiments, the histograms are calculated in the RGB space using 16 x 16 x 16 bins. We increase the reliability of the
color distribution applying a weight function to each bin value depending of the distance between the pixel location and the eye center.
4 Facial Features Tracking
The facial feature tracking process consists in two tasks: eye and nose tracking. As we
said before, eye tracking is based on its color distribution. By weighting the eye
model by an isotropic kernel makes it possible to use a gradient optimization function,
such as the mean-shift algorithm, to search each eye model in the new frame. Practical details and a discussion about the complete algorithm are in [6]. In our
implementation this algorithm performs well and in real time. It is important to
comment that small positional errors could occur. However, it is not important
because the eye tracking results are only used to define the image regions where the
gesture recognition process is performed. Besides, to add robustness to this process
we only consider as search region those pixels belonging to the skin mask.
The important positional results for our system are reported by the nose tracking
algorithm, where the selected features in the Initialization process are used. In this
case, the spatial intensity gradient information of the images is used for finding the
best image registration [2]. As it was before mentioned, for each frame the mean of all
features is computed and it is defined as the nose position for that frame. The tracking
algorithm is robust for handling rotation, scaling and shearing, so the user can move
in a more unrestricted way. But again lighting or fast movements can cause the lost or
displacement of the features to track. As only the features beneath the nose region are
in the region of interest, a feature will be discarded when the length between this
feature and the mean position, the nose position, is greater than a predefined value.
Towards Hands-Free Interfaces Based on Real-Time Robust Facial Gesture Recognition
509
In theory, it would be possible to use Kalman filters for smoothing the positions.
However, Kalman filters are not suited in our case because they don’t achieve good
results with erratic movements such as the face motion [7]. Therefore, our smoothing
algorithm is based in the motion’s tendency of the nose positions (head motion). A
linear regression method is applied to a number of tracked nose positions through
consecutive frames. The computed nose points of n consecutive frames are adjusted to
a line, and therefore the nose motion can be carried out over that line direction. For
avoiding discontinuities the regression line is adjusted with every new point that arrives. Several frames of the tracking sequences are shown in Fig 5.
5 Gesture Recognition
The gestures considered in this work are eye winks. The major part of the works use
high quality images and good image resolution in the eyes zones. However, wink
recognition with webcam quality images is difficult. Besides, this process depends on
the user’s head position. Therefore, our wink detection process is based on a search of
the iris contours. That is, if the iris contours are detected in the image the eye will be
considered as open, if not, the eye will be considered closed. It is important to point
out that this process is robust because it is only carried out in the tracked eye regions
by the mean-shift procedure described before.
Fig. 5. Facial feature tracking results
510
C. Manresa-Yee, J. Varona, and F.J. Perales
The process starts detecting the vertical contours in the image. For avoiding false
positives in this process, the vertical contours are logically operated with a mask
which was generated by thresholding the original image. Finally we keep the two
longest vertical edges of each eye region if they appear to get the eye candidates. If
these two vertical edges which correspond to the eye iris edges don’t appear after the
process for a number of consecutive frames, for gesture consistency, we will assume
that the eye is closed. In Fig. 6 the process for gesture recognition is described.
(a)
(b)
(c)
Fig. 6. Process for recognizing winks. The first row shows the process applied to open eyes.
The second row represents the process over closed eyes. (a) Original image. (b) Vertical edges,
(c) Iris contours.
5 HeadDev
Using the described techniques in previous sections, a functional perceptual interface
has been implemented. This application consists in achieving a system that fulfills
completely the functions of a standard mouse and replaces it by means of face feature
tracking and face gesture recognition.
A highlight of this system is its potential users. Since the use of PUIs can help in eInclusion and e-Accessibility issues, the system can offer assistive technology for
people with physical disabilities, which can help them to lead more independent lives
and to any kind of audience, they contribute to new and more powerful interaction
experiences. So, its use is focused on users with physical limitations in hands or arms
or motion difficulties in upper limbs that can not use a traditional mouse. Other uses
serve to entertainment and leisure purposes, such as computer games or exploring
immersive 3D graphic worlds [5].
By means of the nose tracking process, HeadDev can simulate the mouse’s motion.
The precision required should be sufficient for controlling the mouse’s cursor to the
desired position. To reproduce the mouse motion it can be done through two different
forms: absolute and relative. In the absolute type, the position would be mapped directly onto the screen, but this type would require a very accurate tracking, since a
small tracking error in the image would be magnified on the screen. Therefore, we use
relative motion for controlling the mouse’s motion, which is not so sensitive to the
tracking accuracy, since the cursor is controlled by the relative motion of the nose in
the image. The relative type yields smoother movements of the cursor, due to the nonmagnification of the tracking error. Then, if nt=(xt,yt) is the new nose tracked position
for the frame t, to compute the new mouse screen coordinates, st, we apply
Towards Hands-Free Interfaces Based on Real-Time Robust Facial Gesture Recognition
s t = s t −1 + α (n t − n t −1 ) ,
511
(1)
where α is a predefined constant that depends on the screen size and translates the
image coordinates to screen coordinates. The computed mouse screen coordinates are
sent to the system as real mouse inputs for placing the cursor in the desired position.
Finally, to represent the mouse’s right or left click events, we control winks from
the right or left eye respectively by means of the previously described gesture recognition process.
To evaluate the application’s performance, HeadDev was tested by a set 22 users
where half of them had never experienced with the application and the other half had
previously trained for a short period with the interface. A 5 x 5 point grid was presented in the computer screen where the user had to try clicking on every point; each
point had a radius of 15 pixels. While the user performed the test task, distance data
between the mouse’s position click and the nearest point in the grid was stored to
study the accuracy. The error distance is the distance in pixels of the faulty clicked
positions (clicks that weren’t performed on the targets). In Table 1 the performance
evaluation results are summarized.
Table 1. Summary of the performance evaluation
Users Group
Trained
Novel
Recognized clicks
97,3 %
85,9 %
Mean distance of errors
2 pixels
5 pixels
The experiments have confirmed that continuous training of the users results in
higher skills and, thus, better performances and accuracy for controlling the mouse
position. Besides, a fact to take in account is that this test can produce some neck
fatigue over some users; therefore, some errors clicking the point grid could be caused
due to this reason.
6 Conclusions and Future Work
In this paper we have proposed a new mixture of several computer vision techniques,
where some of them have been improved and enhanced to reach more stability and
robustness in tracking and gesture recognition. Numerical and visual results are given.
In order to build reliable and robust perceptual user interfaces based on computer
vision, certain practical constraints must be taken in account: the application must be
robust to work in any environment and to use images from low cost devices. In this
paper we present a VBI system that accomplishes these constraints. As a system application, we present an interface that is able to replace the standard mouse motions
and events. Currently, the system has been tested by several disabled people (cerebral
paralysis and physical disabilities) with encouraging results. Of course, more improvements have to be done, including more gestures (equivalents with BLISS commands or other kind of language for disabled persons), sound (TTS and ARS) and
adaptive learning capabilities for specific disabilities. Enhancements have been
512
C. Manresa-Yee, J. Varona, and F.J. Perales
planned as future work, such as including a bigger set of head and face gestures in
order to support current computer interactions such as web surfing.
HeadDev for Microsoft Windows is available under a freeware license in the Web
page http://www.tagrv.com. This will allow us to test the application by users around
the world and we will able to improve the results by analyzing their reports. In near
future, a linux version will be also available.
Acknowledgements
This work has been subsidized by the national project TIN2004-07926 from the
MCYT Spanish Government and TAGrv S.L., Fundación Vodafone, Asociación de
Integración de Discapacitados en Red. Javier Varona acknowledges the support of a
Ramon y Cajal grant from the Spanish MEC.
References
1. Alexander, D.C., Buxton, B.F.: Statistical Modeling of Colour Data. International Journal
of Computer Vision 44 (2001) 87–109
2. Baker, S., Matthews, I.: Lucas-Kanade 20 Years On: A Unifying Framework. International
Journal of Computer Vision 56 (2004) 221–225
3. Betke, M., Gips J., Fleming, P.: The Camera Mouse: Visual Tracking of Body Features to
Provide Computer Access for People with Severe Disabilities. IEEE Transactions on neural systems and Rehabilitation Engineering 10 (2002)
4. Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press (1995)
5. Bradski, G.R.: Computer Vision Face Tracking as a Component of a Perceptual User Interface. In: Proceedings of the IEEE Workshop on Applications of Computer Vision (1998)
214–219
6. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based Object Tracking. IEEE Transactions
on Pattern Analysis and Machine Intelligence 25 (2003) 564–577
7. Fagiani, C., Betke, M., Gips, J.: Evaluation of Tracking Methods for Human-Computer Interaction. In: Proceedings of the IEEE Workshop on Applications in Computer Vision
(2002) 121–126
8. Fasel, B., Luettin, J.: Automatic Facial Expression Analysis: A Survey. Pattern Recognition 36 (2003) 259–275
9. Gorodnichy, D.O.: Towards automatic retrieval of blink-based lexicon for persons suffered
from brain-stem injury using video cameras. In: Proceedings of the IEEE Computer Vision and Pattern Recognition, Workshop on Face Processing in Video (2004)
10. Gorodnichy, D.O., Malik, S., Roth, G.: Nouse ‘Use Your Nose as a Mouse’ – a New
Technology for Hands-free Games and Interfaces. Image and Vision Computing 22 (2004)
931–942
11. Grauman, K., Betke, M., Gips, J., Bradski, G.: Communication via Eye Blinks Detection
and Duration Analysis in Real Time. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (2001)
12. Grauman, K., Betke, M., Lombardi, J., Gips, J.,Bradski G.R.: Communication via eye
blinks and eyebrow raises: video-based human-computer interfaces Video-Based HumanComputer Interfaces. Universal Access in the Information Society 2 (2003) 359–373
13. EyeTech Quick Glance, http://www.eyetechds.com/qglance2.htm (2006)
Towards Hands-Free Interfaces Based on Real-Time Robust Facial Gesture Recognition
513
14. Kölsch, M., Turk, M.: Perceptual Interfaces. In Medioni, G., Kang S.B.(eds): Emerging
Topics in Computer Vision, Prentice Hall (2005)
15. Morimoto, C., Mimica, M.: Eye gaze tracking techniques for interactive applications.
Computer Vision and Image Understanding 98 (2005) 4–24
16. Shi, J., and Tomasi, C.: Good Features to Track. In: Processdings of the IEEE Conference
on Computer Vision and Pattern Recognition (1994) 593–600
17. Toyama, K.: Look, Ma – No Hands!”Hands-Free Cursor Control with Real-Time 3D Face
Tracking. In Proceedings of the Workshop on Perceptual User Interfaces (1998) 49–54
18. Turk, M., Robertson, G.: Perceptual User Interfaces. Communications of the ACM 43
(2000) 32–34
19. Viola, P., Jones, M.: Robust Real-Time Face Detection. International Journal of Computer
Vision 57 (2004) 137–154
20. Zelinsky, A., Heinzmann, J.: Real-Time Visual Recognition of Facial Gestures for
Human-Computer Interaction. In: Proceedings of the IEEE Automatic Face and Gesture
Recognition (1996) 351–356
Download