Emotion Recognition from Human Eye Expression S.R.Vinotha1,a

advertisement
Emotion Recognition from Human Eye
Expression
S.R.Vinotha1,a, R.Arun2,b and T.Arun3,c
Assistant Professor, Dhanalakshmi College of Engineering
2
Student, Dhanalakshmi College of Engineering
3
Student, Dhanalakshmi College of Engineering
a
vinotharamaraj@gmail.com, b r.arun45@gmail.com, c arun.t1992@gmail.com
1
Abstract – Facial expressions play an
essential role in communications in social
interactions with other human beings
which deliver rich information about
their emotions. The most crucial feature
of human interaction that grants
naturalism to the process is our ability to
infer the emotional states of others. Our
goal is to categorize the various human
emotions from their eye expressions. The
proposed system presents a human
emotion recognition system that analyzes
the human eye region from video
sequences. From the frames of the video
stream the human eyes can be extracted
using the well known canny edge
operator and classified using a non –
linear Support Vector machine (SVM)
classifier. Finally, we use standard
learning tool, Hidden Markov Model
(HMM) for recognizing the emotions
from the human eye expressions.
and surveillance, they can predict the
offender or criminal’s behaviour by
analysing the images of their faces from
the frames of the video sequence.
The analysis of human emotions
can be applied in a variety of application
domains, such as video surveillance and
human – computer interaction systems. In
some cases, the results of such analysis can
be applied to identify and categorize the
various human emotions automatically
from the videos.
The six primary or main types of
emotions are as follows – fear, joy, love,
sadness, surprise and anger. Our method is
to use the feature extraction technique to
extract the eyes, support vector machine
(SVM) classifier and a HMM to build a
human emotion recognition system.
The remainder of the paper has
organized to explain the related work in
section 2, proposed methodology in
section 3, Data collection in section 4,
experimental results in section 5, and
concludes the work in section 6.
Keywords – Human emotions, Canny edge
operator, Support Vector Machine (SVM),
Hidden Markov Model (HMM)
II.
RELATED WORK
In the last two decades, many
approaches for human emotion recognition
have been proposed. Kumar et al. [1]
developed a frequency domain feature of
face images for recognition by proposing a
cross-correlating method based upon the
fast Fourier transform. Savvides et al.[2]
further extended a correlation filter and
developed a hybrid PCA-correlation filter
called “Core faces,” that performed robust
I.
INTRODUCTION
Human emotion recognition is an
important component for efficient human –
computer interaction. It plays a critical role
in communication, allowing people to
express oneself beyond the verbal domain.
Analysis of emotions from human eye
expression involves the detection and
categorization of various human emotions
or state of mind. For example, in security
1
illumination-tolerant face recognition.
Picard et al. [3] stressed the significance of
human emotions on their affective
psychological states.
Ekman et al. [4], [5], who analyzed
six fundamental facial expressions and
encoded them into the so-called facial
action coding system (FACS). FACS
enumerates action units (AUs) of a face
that cause facial movements. Korma et al.
[6] used 17 features; however, 5 of them
are specific to their scenarios they are
calculated based on eye movements over
circles that are formed of images.
This in-depth survey disclosed the
fact that many approaches that were used
before uses more than two features to
identify the facial expression or human
emotions. Hence in this paper we proposed
an efficient way of emotion recognition
using only the human eye expressions.
A. Pre-processing
The initial stage of the human
emotion recognition system is the preprocessing of the image representing the
set of frames.
Fig.2. Before and after pre-processing
Pre-processing images commonly
involves
removing
low-frequency
background noise, normalizing the
intensity of the individual particles images,
removing reflections, and masking
portions of images. Image pre-processing
is the technique of enhancing data images
prior to computational processing.
Examples: Normalization, Edge filters,
Soft focus, selective focus, User-specific
filter, Static/dynamic binarisation,Image
plane separation, Binning.
The pre-processing techniques that
are used here are as follows:
Filters-Median filter: The median filter is
a nonlinear digital filtering technique,
often used to remove noise. Such noise
reduction is a typical pre-processing step
to improve the results of later processing.
The main idea of the median filter is to run
through the signal entry by entry, replacing
each entry with the median of
neighbouring entries. The pattern of
neighbours is called the "window", which
slides, entry by entry, over the entire
signal.
Smoothing: To smooth a data set is to
create an approximating function that
attempts to capture important patterns in
the data, while leaving out noise or other
fine-scale structures/rapid phenomena. In
smoothing, the data points of a signal are
modified so individual points (presumably
because of noise) are reduced, and points
that are lower than the adjacent points are
increased leading to a smoother signal.
III.
METHODOLOGY
The detailed components of the system are
shown in Fig.1.
Fig.1. Functional flow diagram of the system
An image representing a set of frames is
preprocessed and a noise free image is
obtained. The noise free image is edge
detected using Canny Edge Operator.
Using the feature extraction process, the
eye regions are extracted from the resultant
edge detected image. The extracted eye
regions are classified using SVM
classifier. Finally, the corresponding
emotions are recognized and categorized
using HMM technique.
2
B. Edge Detection
Edge detection method finds the
edges in the given image and returns a
binary matrix where the edges have value
1 and all other pixels are 0. The given
image is a non-sparse numeric array but
the output image is of class logical which
means the matrix of the output image will
be with 0’s and 1’s. The output of the edge
detection should be an edge image or edge
map, in which the value of each pixel
reflects how strong the corresponding
pixel in the original image meets the
requirements of being an edge pixel.
and the vertical direction (Gy). From this
the edge gradient and direction can be
determined:
Non-maximum
suppression:
Given
estimates of the image gradients, a search
is then carried out to determine if the
gradient magnitude assumes a local
maximum in the gradient direction.
Tracing edges through the image and
hysteresis thresholding: Thresholding
with hysteresis requires two thresholds –
high and low. We begin by applying a high
threshold. This marks out the edges that
can be fairly genuine. Starting from these,
using the directional information derived
earlier, edges can be traced through the
image. While tracing an edge, we apply
the lower threshold, allowing us to trace
faint sections of edges as long as we find a
starting point.
Once this process is complete we
have a binary image where each pixel is
marked as either an edge pixel or a nonedge pixel. From complementary output
from the edge tracing step, the binary edge
map obtained in this way can also be
treated as a set of edge curves, which after
further processing can be represented as
polygons in the image domain.
Differential geometric formulation of
the Canny edge detector: A more refined
approach to obtain edges with sub-pixel
accuracy is by using the approach of
differential edge detection, where the
requirement of non-maximum suppression
is formulated in terms of second- and
third-order derivatives computed from a
scale space representation.
Variational-geometric formulation of
the Haralick-Canny edge detector : A
variational explanation for the main
ingredient of the Canny edge detector, that
is, finding the zero crossings of the 2nd
derivative along the gradient direction, was
shown to be the result of minimizing a
Fig.3. Edge detection
Here the face boundary is extracted by
using the well known canny edge operator.
The canny edge detection algorithm is
known to many as the optimal edge
detector. The Canny edge detector is an
edge detection operator that uses a multistage algorithm to detect a wide range of
edges in images as in Fig.3.
Stages of the Canny algorithm:
Noise reduction: Because the Canny edge
detector is susceptible to noise present in
raw unprocessed image data, it uses a filter
based on a Gaussian (bell curve), where
the raw image is convolved with a
Gaussian filter. The result is a slightly
blurred version of the original which is not
affected by a single noisy pixel to any
significant degree.
Finding the intensity gradient of the
image : An edge in an image may point in
a variety of directions, so the Canny
algorithm uses four filters to detect
horizontal, vertical and diagonal edges in
the blurred image. The edge detection
operator (Roberts, Prewitt, Sobel for
example) returns a value for the first
derivative in the horizontal direction (Gx)
3
Kronrod-Minkowski functional while
maximizing the integral over the alignment
of the edge with the gradient field.
position of vertical center of face (d),
because the area between both eyes is most
bright in the horizontal region.
C. Feature extraction
In a recent paper, Moriyama et al.
[7] demonstrate that precise and detailed
detection and feature extraction from the
eye region is already possible.
Extraction of eyes: The most dominant
and reliable features of the face, the eyes,
provide
a
constant
channel
of
communications. When the rough face
region is detected, as we have said, an
efficient feature based method will be
sequentially applied to locate the rough
regions of both eyes. Fig.4 shows the
processes of the proposed method.
(a)
Fig. 4. Detection of eyes regions
(b)
(c)
(d)
(e)
(f)
(g)
The first step is to calculate the
gradient image (b) of the rough face region
image (a). Then we apply a horizontal
projection to this gradient image. As we
know that the eyes locate in the upper part
of the face and that the pixels near the eyes
are more changeful in value comparing
with the other parts of face, it is obvious
that the peak of this horizontal projection
in the upper part can give us the horizontal
position of eyes.
According to this horizontal
position and the total height of the face, we
can easily line out a horizontal region (c)
in which the eyes locate.
And then we perform a vertical
projection to all pixels in this horizontal
region of image (c), and a peak of this
projection can be found near the vertical
center of face image. In fact, the position
of this vertical peak can be treated as the
Fig.5.Eye extraction by feature based method
And then we perform a vertical
projection to all pixels in this horizontal
region of image (c), and a peak of this
projection can be found near the vertical
center of face image. In fact, the position
of this vertical peak can be treated as the
position of vertical center of face (d),
because the area between both eyes is most
bright in the horizontal region.
In the same time, a vertical
projection will be done to the gradient
image (b). There are two peaks of
projection near the right and left boundary
4
of face image which correspond to right
and left limit of the face (e). In addition,
from these two vertical limit lines, the
width of face can be easily estimated.
Combining all results from (c), (d)
and (e), we can get an image segmented
like (f). Finally, based on the result of (f)
and the estimated width of face, the
regions of both eyes can be lined out (g).
assumption is made that the larger the
margin or distance between these parallel
hyperplanes the better the generalization
error of the classifier will be [11]. We
consider data points of the form
{(x1,y1),(x2,y2),(x3,y3),(x4,y4).,(xn, yn)}
Where yn=1 / -1 , a constant denoting the
class to which that point xn belongs. n =
number of sample. Each xn is pdimensional real vector. The scaling is
important to guard against variable
(attributes) with larger variance. We can
view this Training data, by means of the
dividing (or separating) hyperplane , which
takes
D. Classification (SVM classifier)
A Support Vector Machine (SVM)
is a maximal margin hyper-plane in feature
space built by using a kernel function in
gene space. SVMs are a state-of-the-art
classification technique where patterns can
be described by a finite set of
characteristic features [8]. It has large
application fields in text classification,
face recognition, genomic classification,
etc.
The SVM as a non-linear classifier
handles overlapping effectively. Also
SVM has achieved very good performance
in lots of real-world classification
problems. Finally, SVM can deal with
very high dimensional feature vectors,
which means that we can choose the
feature
vectors
without
restrictive
dimension limits.
Multi-class SVMs are usually
implemented by combining several twoclass SVMs. In each binary SVMs, only
one class is labelled as "1" and the others
labelled as "-1". If there are M classes,
SVM will construct M binary classifiers
by learning. During the testing process,
each classifier will get a confidence
coefficient and the class with maximum
confidence coefficient will be assigned to
this test sample.
SVM map input vector to a higher
dimensional space where a maximal
separating hyperplane is constructed. Two
parallel hyperplanes are constructed on
each side of the hyperplane that separate
the data. The separating hyperplane is the
hyperplane that maximize the distance
between the two parallel hyperplanes. An
w.x+b=o
Where b is scalar and w is p-dimensional
Vector. The vector w points perpendicular
to the separating hyperplane. Adding the
offset parameter b allows us to increase the
margin. Absent of b, the hyperplane is
forced to pass through the origin,
restricting the solution. As we are
interesting in the maximum margin, we are
interested in the SVM and the parallel
hyper planes. Parallel hyperplanes can be
described by equation
w.x + b = 1
w.x + b = -1
Figure.6 Maximum margin hyperplanes for a SVM trained
with samples from two classes
If the training data are linearly separable,
we can select these hyperplanes so that
5
distribution, and π is the initial state
distribution. The number of states of the
HMM is given by N. In the discrete case,
B becomes a matrix of probability entries
(Conditional Probability Table), and in the
continuous case, B will be given by the
parameters of the probability distribution
function of the observations (normally
chosen to be the Gaussian distribution or a
mixture of Gaussians).
there are no points between them and then
try to maximize their distance. By
geometry, We find the distance between
the hyperplane is 2 / │w│. So we want to
minimize │w│. To excite data points, we
need to ensure that for all I either
w. xi – b ≥ 1 or w. xi – b ≤ -1
This can be written as
yi ( w. xi – b) ≥1 , 1 ≤ i ≤ n
Emotion-Specific HMMs: Since the
display of a certain emotion in video is
represented by a temporal sequence of
facial motions it is natural to model each
eye expression using an HMM trained for
that particular type of emotion. There will
be six such HMMs, one for each emotion:
happy(1), angry(2), surprise(3), disgust(4),
fear(5), sad(6) . There are several choices
of model structure that can be used. The
two main models are the left-to-right
model and the ergodic model.
Kernel Selection of SVM: Training
vectors xi are mapped into a higher (may
be infinite) dimensional space by the
function Ф. Then SVM finds a linear
separating hyperplane with the maximal
margin in this higher dimension space .C >
0 is the penalty parameter of the error
term. Furthermore,
K(xi , xj) ≡ Ф(xi)T Ф(xj)
is called the kernel function.
E. Recognition (Hidden Markov Model)
Hidden Markov models have been
widely used for many classification and
modelling problems. One of the main
advantages of HMMs is their ability to
model non-stationary signals or events.
Dynamic programming methods allow one
to align the signals so as to account for the
non-stationarity. The HMM finds an
implicit time warping in a probabilistic
parametric fashion. It uses the transition
probabilities between the hidden states and
learns the conditional probabilities of the
observations given the state of the model.
In the case of emotion expression, the
signal is the measurements of the eye
expression.
In [9], Otsuka and Ohya used leftto-right models with three states to model
each type of facial expression. The
advantage of using this model lies in the
fact that it seems natural to model a
sequential event with a model that also
starts from a fixed starting state and
always reaches an end state. It also
involves fewer parameters, and therefore is
easier to train. However, it reduces the
degrees of freedom the model has to try to
account for the observation sequence. On
the other hand, using the ergodic HMM
allows more freedom for the model to
account for the observation sequences, and
in fact, for an infinite amount of training
data it can be shown that the ergodic
model will reduce to the left-to-right
model, if that is indeed the true model.
An HMM is given by the following set of
parameters:
The observation vector Ot for the
HMM represents continuous motion of the
facial action units. Therefore, B is
represented by the probability density
λ = (A, B,π)
where A is the state transition probability
matrix, B is the observation probability
6
functions (pdf) of the observation vector at
time t given the state of the model. The
Gaussian distribution is chosen to
represent these pdf’s, i.e.,
inducing the emotion through events in the
normal environment of the subject, not in a
studio.
V.
B = { bi (Ot) } ˜ N ( µj , ∑j ), 1≤j ≤ N
The above algorithms are applied on
various face images containing the frontal
view of the human face using Matlab7.0.
The images were obtained from the fixed
CCTV.
where µj and ∑j are the mean vector and
full covariance matrix, respectively.
The parameters of the model of emotionexpression specific HMM are learned
using the well-known Baum-Welch reestimation formulas. For learning, hand
labelled sequences of each of the facial
expressions are used as ground truth
sequences, and the Baum algorithm is used
to derive the maximum likelihood (ML)
estimation of the model parameters (λ ).
Surprise
Happy
Fear
Disgust
Angry
Sad
Fig.7.Sample eye expressions
For emotion recognition, the set of
frames representing a video is given as
input. Initially the various pre-processing
techniques are used to remove the noise
present in the image. The noise free image
is subjected to canny edge detector and the
edge detected image is obtained as output.
From the edge detected image, the eyes
part are extracted using feature based
method. The extracted image is then
classified using the SVM classifier.
Finally the HMM votes for the model
which has maximum probability. That
action label will be obtained as output
from HMM as in Fig.7.
Parameter learning is followed by the
construction of a ML classifier. Given an
observation sequence Ot , where t ε (1,T),
the probability of the observation given
each of the six models P (Ot | λj ) is
computed using the forward-backward
procedure [10]. The sequence is classified
as the emotion corresponding to the model
that yielded the highest probability, i.e.,
c*= argmax [P (O|λc ) ]
1≤c≤6
IV.
EXPERIMENTAL RESULTS
DATA COLLECTION
In order to test the algorithms
described in the previous sections, we
collected data of people that are instructed
to display facial expressions corresponding
to the six types of emotions. The video
was used as the input and the sampling rate
was 30 Hz.
VI.
CONCLUSION
Recent research documents tell that the
understanding
and
recognition
of
emotional expressions plays an important
role in the development and maintenance
of social relationships. In this work, a
novel and efficient framework for human
emotion recognition system is proposed.
Compared to existing methods, SVM is
robust and efficient classifier to label the
emotions. HMM is the dynamic classifier
to recognize the human emotions because
it achieves good accuracy. This approach
The data was collected in an open
recording scenario, where the person was
asked to display the expression
corresponding to the emotion being
induced. This is of course not the ideal
way of collecting emotion data. The ideal
way would be using a hidden recording,
7
is useful for real-world problems such as
human-computer interaction, security
surveillance and understanding tutor. In
future, this work may be extended to
identify the human emotions for the people
wearing specs by using only eye
expressions.
vo[13] Recognition Of Facial Expressions
Of Six Emotions By Children With
Specific Language Impairment, Kristen D.
Atwood, Brigham Young University,
August 2006
[8] Weiming Hu and Tieniu Tan “A
Survey on Visual Surveillance of Object
Motion
and
Behaviors”,
IEEE
Transactions on SMC, Vol.34, No.3,
August 2004.
REFERENCES
[1]B. V. Kumar, M. Savvides, K.
Venkataramani, and X. Xie, “Spatial
frequency domain image processing for
biometric recognition,” in Proc. IEEE.
Intl. Conf. Image Process., 2002, vol. 1,
pp. 53 56.
[9] T. Otsuka and J. Ohya. “Recognizing
multiple persons’ facial expressions using
HMM based on automatic extraction of
significant frames from image sequences”.
In Proc. Int. Conf. on Image Processing
(ICIP-97), pages 546–549, Santa Barbara,
CA, USA, Oct. 26-29, 1997.l. 28, no. 5,
pp. 738-752, May 2006.
[2] M. Savvides, B. Kumar, and P. Khosla,
“Corefaces—Robust shift invariant PCA
based correlation filter for illumination
tolerant face recognition,” in Proc. IEEE,
Comput. Vis. Pattern Recognit., Jun. 2004,
vol. 2, pp. 834–841.
[10] L.R. Rabiner. “A tutorial on hidden
Markov models and selected applications
in speech processing.” Proceedings of
IEEE, 77(2):257–286, 1989.
[3] R. W. Picard, E. Vyzas, and J. Healey,
“Toward machine emotional intelligence:
Analysis of affective psychological states,”
IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 23, no. 10, pp. 1175–1191, Oct. 2001.
[11] V. Vapnik. The Nature of Statistical
Learning Theory. NY: Springer-Verlag.
1995.
[4] P. Ekman and W. V. Friesen,
Unmasking the Face. Englewood Cliffs,
NJ: Prentice-Hall, 1975.
[5] P. Ekman and W. V. Friesen, The
Facial Action Coding System. San
Francisco, CA: Consulting Psychologist,
1978.
[6] L. Kozma, A. Klami, and S. Kaski,
“Gazir: Gaze-based zooming interface for
image retrieval,” in Proc. 2009 Int. Conf.
Multimodal Interfaces, New York, 2009,
pp. 305–312, ser. ICMI-MLMI’09, ACM.
[7] T. Moriyama, T. Kanade, J. Xiao, and
J.F. Cohn, “Meticulously Detailed Eye
Region Model and Its Application to
Analysis of Facial Images,” IEEE Trans.
Pattern Analysis and Machine Intelligence,
8
Download