Uploaded by Pratik Kumar Sahoo

Emotion Recognition using Facial Expressions in Children using the NAO Robot

advertisement
Emotion Recognition using Facial Expressions in
Children using the NAO Robot
Alejandro Lopez-Rincon
Life Sciences and Health
Centrum Wiskunde & Informatica
Amsterdam, Netherlands
alejandro.lopez@iscpif.fr
Abstract—The detection of human emotions from facial expressions is crucial for social interaction. Therefore, several systems of
behavioral computing in robotics try to recognize human emotion
from images and video, but most of them are trained to classify
emotions in adults only. Using the standard of 6 basic emotions:
sadness, happiness, surprise, anger, disgust, and fear, we try to
classify the facial expressions using the NAO robot in children.
In this study, we make the comparison between the AFFDEX
SDK, and a Convolution Neural Network (CNN) with Viola-Jones
trained with the AffectNet dataset, and tuned with the NIMHChEF dataset using transfer learning to classify facial expressions
in children. Then, we test our system comparing the CNN and
the AFFDEX SDK for classification in the Child Affective Facial
Expression (CAFE) dataset. Finally, we compare both systems
using the NAO robot in a subset of the AM-FED and EmoReact
datasets.
Index Terms—emotion recognition, human robot interaction,
CNN
I. I NTRODUCTION
Human interaction with robots for social activities has
increased in the recent years, as both communication and
hardware technologies have improved. Current systems, are
just not able to move, talk or do video analysis like face
recognition, but also there is growing interest in using robots
for emotional assessment of people [8], [14], [37].
One of the most used robots for human interaction is
the NAO robot for its design, and programming capabilities.
Several studies have been made using the NAO robot. For
example, using the Kinect Sensor from Microsoft to control
the movement [2], [17], [30], programmed to express the basic
emotions [3], for social interactions with the elderly [21] and
children with Autism [28].
For human robot interaction, we need to improve the perception of human emotion. Following the study from Ekman [5],
we have distinctive facial expressions for the basic emotions:
anger, fear, sadness, disgust, enjoyment and surprise and
neutral. Thus, one approach to classify human emotions is
through Facial Action Coding System (FACS).
For this method, we detect the landmarks in the face as in
Fig. 1. Next, we consider the movement of muscles in the face,
or Action Units (AUs) and mapped them into emotions. For
example, happiness will be defined by AUs 6 + 12 ( Cheek
Raiser + Lip Corner Puller) [6]. The FACS based emotion
classification is used in robotics [43], [44], and mobile devices
using portable systems such as AFFDEX SDK [27].
978-1-7281-1145-2/19/$31.00 ©2019 IEEE
AFFDEX is a system trained with 6.5 million facial expressions to recognize human emotion. AFFDEX has been tested
in several systems, as it is an out-of-the-box solution [11], [19],
[25]. AFFDEX SDK particularly has been used in humanrobot interaction. For example, to improve the performance of
a robot as a personal companion [24], and as a robot tutor
for children [9]. More complex applications include; to assess
reading skills in children for a second language [41], and
serving as a tool to improve communication by reading the
emotions of people with motor disabilities [13].
Fig. 1. Landmark facial detection using AFFDEX SDK in a photograph from
the NIMH-ChEF Dataset.
Another approach to classify human emotions from visual
analysis is the use of Convolution Neural Networks (CNNs).
CNNs are based in how the visual cortex works in animals by
overlapping windows for classification [15]. CNNs are used
to classify different problems, such as medical data [18], [22],
[36], audio [10], and images [12], [33] just to mention some
examples. In comparison to FACS based systems, CNNs do
not require to calculate exact position of AUs and just use
the whole face to calculate the expression [7], [16], given a
training dataset and the proper architecture.
There are some open available datasets for human emotion classification such as: Affectiva-MIT Facial Expression Dataset (AM-FED) [26], Extended Cohn-Kanade dataset
(CK+) [23], and Affectnet [29]. AM-FED dataset consists of
242 facial videos in real world conditions, with smile and
146
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on January 06,2024 at 04:35:45 UTC from IEEE Xplore. Restrictions apply.
some FACS AUs information. CK+ has 593 sequences from
123 subjects for the 6 basic emotions. Affectnet dataset is
made of 1250 images form the internet in normal conditions,
half labelled by humans and half by computer for the 6 basic
emotions plus contempt, uncertain, and no-face.
Most of the available datasets were created with adults
mostly, therefore the studies do not focus in children. Nevertheless, there are some datasets that focus on the analysis of
children facial expressions such as the Child Affective Facial
Expression (CAFE) dataset [20], the NIMH Child Emotional
Faces Picture Set (NIMH-ChEF) [4] and the EmoReact [31].
The CAFE dataset consists of 1192 photographs of children
with ages ranging from 2-8 years old for the 6 basic expressions and neutral with human classification. EmoReact is an
expression dataset that consists of 1102 videos of children
ages ranging from 4-14 years old that measures if Curiosity,
Uncertainty, Excitement, Happiness, Surprise, Disgust, Fear,
Frustration are present in the video and gives a measure of
valence. NIMH-ChEF dataset has 534 photographs of fearful,
angry, happy, sad and neutral child face expressions looking
directly to the camera and averted.
In this study, we compare a CNN with Viola-Jones trained
with a subset of the Affectnet and re-trained with the NIMHChEF dataset, against a FACS based system using AFFDEX
SDK for human emotion recognition based in facial expressions. We compare both systems by measuring the classification accuracy in the CAFE dataset. Then, we put the system
into the NAO robot, and use its video capture for detecting
smile in individuals from 5 videos of the AM-FED dataset,
and processing the reaction in 5 videos from the EmoReact
dataset to try to simulate emotion recognition in real time.
In our overall system, we map these instructions to control
the NAO Robot using a controller as explained in Fig. 2.
Besides the movement instructions from the controller we add
the emotion recognition into the system, and the reaction. To
avoid racing conditions in the robot we send the instructions
in the robot as in Fig. 3. In summary, we have three main subsystems: the movement, and walking of the NAO robots, the
emotion recognition using the NAO camera, and the reaction
to the emotion.
II. M ETHODOLOGY
We programmed a light version of the NAOqi [35] library
for .NET, to make it compatible with visual studio 2017,
AFFDEX SDK and EMGUCV [39]. This light version includes the basic motion of arms, hand, head, walking, text-tospeech (TTS) and video input of the NAO robot. The video
input is configured to 320x240 color images with automatic
head tracking. Next, for each frame we make an analysis using
AFFDEX or the CNN.
Fig. 2. Control of the NAO robot in the overall system.
A. NAOqi .NET lite version
As current versions of NAOqi are not visual studio compatible, or use specific version of Python libraries to run, we
develop a lite version of the NAOqi to control the NAO robot.
This NAOqi .NET lite version is able to interface between
the AFFDEX SDK, python 3.0, Tensorflow, Accord .NET
[40] and EMGUCV under Visual Studio 2017, and Visual
Studio 2015. We tested our framework in NAO Robot V4 and
V5. For this purpose, we include the instructions for moving
the following joints: LShoulderPitch, RShoulderPitch, LShoulderRoll, RShoulderRoll, LElbowYaw, RElbowYaw, LHand,
RHand, HeadPitch, HeadYaw. In addition, we include the
instructions for videoframe call, TTS, and moveTo.
978-1-7281-1145-2/19/$31.00 ©2019 IEEE
Fig. 3. Clock Signal of the NAO robot control.
B. CNN Architecture with Viola-Jones
The CNN we use has a 4-layer architecture with chosen
hyperparameters given by Bayesian Optimization [38] trained
with a subset of AffectNet. The subset of AffectNet consists
of 283,901 faces for the 6 basic emotions and neutral labeled
by humans. The CNN architecture has 4 layers with a Softmax
147
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on January 06,2024 at 04:35:45 UTC from IEEE Xplore. Restrictions apply.
as in Fig. 4. The activations units are Rectified Linear Units
(RELUs).
Fig. 6. Hyper parameter explanation in each of the CNN layers.
Fig. 4. CNN structure with 3 convolution layers and fully connected layer
with a Softmax for emotion recognition.
The input is a 100x100 greyscale image of a face detected using the Viola-Jones algorithm with Haar feature-based
cascade classifiers [42] from the EMGUCV framework. The
Viola-Jones algorithm is a machine learning technique based
in detecting specific characteristics like edges in an image. In
a cascade classifier, the algorithm detects an order in these
characteristics to then detect an object. In the case of face
recognition, the algorithm has to be trained using several
positive (with faces) and negative images (without faces).
In the EMGUCV library there are already pretrained
face detection cascade classifiers. In a first stage, we used
the haar cascade def ault classifier, but in order improve
the performance and detection of faces, we included a the
haar cascade alt2 classifier. Thus, our system works using
both classifiers as in Fig. 5. Finally, to improve our face
detection even further we use the cascade classifier available
in the AFFDEX SDK, in addition to the haar cascade alt2
present in the EMGUCV.
SDK. Using this, we recognize 267,864 faces out of 283,901.
Then, we divide the subset into 254,470 training and 13,394
for validation. From the system we get 68.4% accuracy on
the validation set. The reported accuracy on the Affectnet
dataset using a ResNext Neural Network is 65%. The resulting
confusion matrix for validation set is in Table I.
TABLE I
C ONFUSION M ATRIX OF THE THE VALIDATION
Neutral
Happy
Sad
Surprise
Fear
Disgust
Anger
Neutral
2797
705
761
382
161
81
594
Happy
431
5597
137
130
36
38
98
Sad
45
40
188
16
21
6
27
Surprise
12
15
2
83
41
1
10
SET
Fear
0
0
0
3
15
0
1
(13,394
FACES ).
Disgust
0
0
0
0
0
0
0
Anger
172
62
99
22
26
58
481
To improve the accuracy of our system for the CAFE
dataset, we use the NIMH-ChEF database to fine-tuned our
pre-trained model. As the new dataset is small (534 faces)
in comparison to the original (283,901 faces), we reduce
our learning rate by a factor of 1/100 to avoid over-fitting,
and retrain the system. This is known as weight tuning with
transfer learning in Convolution Neural Networks [32]. The
summary of how we use the data is explained in Fig. 7.
Fig. 5. CNN structure with 3 convolution layers and fully connected layer
with a Softmax for emotion recognition, and 2 input cascade classifiers.
The first three layers of the CNN have the form as in Fig. 6,
where i is the number of the layer, v the size of the convolution
window, w the number of output channels, and h the max
pooling window size. We have an extra hyper parameter from
w4 for the size of the fully connected layer. The training of the
CNN is done using Tensorflow library [1]. We choose these
hyper parameters using Bayesian Optimization with the best
result after 65 architectures using a 115,864 faces subset of
Affectnet. The hyper parameters are the following w1 = 255,
w2 = 255, w3 = 255, w4 = 256, h1 = 4, h2 = 3, h3 = 2,
v1 = 12, v2 = 3, v3 = 2.
From the 283,901 faces available in Affectnet we use
different combinations of the cascade classifiers available in
EMGUCV and the AFFDEX SDK. Then we divide the dataset
into 95% Training and 5% Validation. The best combination
is using the cascade classifier available in the AFFDEX
978-1-7281-1145-2/19/$31.00 ©2019 IEEE
Fig. 7. Summary of the used Datasets for training the CNN system.
To train the system we assign a label for each of the basic
emotions and neutral. The summary of number of cases for
each emotion in each dataset is given in Table II.
C. Emotion Recognition System
For our analysis, we will have two possible systems as
explained in Fig. 8. The system asks for a 320x240 frame
from the camera in the robot. System 1 will use the AFFDEX
SDK and system 2 will use the CNN to classify the emotions
from the captured frames. The frame rate for the CNN is 6.32
frames per second (fps) and 3.38 fps for AFFDEX, under the
148
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on January 06,2024 at 04:35:45 UTC from IEEE Xplore. Restrictions apply.
TABLE II
S UMMARY OF CASES FOR EACH EMOTION IN EACH DATASET.
CAFE
230
215
108
103
140
191
205
1192
Neutral
Happy
Sad
Surprise
Fear
Disgust
Anger
Total
Affectnet
70473
128064
23354
13139
5861
3573
23400
267864
NIMH-ChEF
108
104
107
0
104
0
111
534
TABLE IV
C ONFUSION MATRIX FOR THE CNN SYSTEM IN
Label
0
1
2
3
4
5
6
Neutral
Happy
Sad
Surprise
Fear
Disgust
Anger
Neutral
173
2
49
4
38
12
8
Happy
20
196
14
25
42
50
71
Sad
3
12
16
4
6
22
18
Surprise
23
2
4
53
42
0
5
THE
Fear
0
0
0
5
2
0
1
CAFE
DATASET.
Disgust
0
0
0
0
0
0
0
Anger
10
3
22
8
7
105
95
TABLE V
C OMPARISON IN ACCURACY FOR THE CAFE DATASET. DF= D ETECTED
FACES , IN THE T EST SET (CAFE), T RAIN SET (A FFECTNET ), R ETRAIN
SET (NIMH C HILD ).
Classifier
CNN-Viola-Jones
CNN-2-Viola-Jones
CNN-2-Viola-Jones Re-Trained
CNN-AFFDEX Viola Jones
CNN-AFFDEX Viola Jones Re-Trained
AFFDEX SDK
Fig. 8. Summary of the emotion recognitions systems.
same conditions in a computer with an intel processsor i77700HQ, 16.0 GB RAM with a 6.0 GB NVIDIA graphics
card.
Accuracy Test
0.396812
0.407718
0.425336
0.43959731543
0.44882550335
0.42449664429
DF Test
1105
1147
1147
1172
1172
1179
DF Train
115632
255346
255346
267864
267864
-
DF Retrain
484
534
-
As far as we know, the only available classification study on
the CAFE dataset is the study from ROA [34]. In comparison
to this study, we do not use the CAFE dataset as part of the
training, just for the test set.
B. Test 2
III. T ESTS
A. Test 1
For this test, we compare the emotion recognition in the
CAFE dataset. The dataset has 1,192 photos of children for the
6 basic emotions and neutral. We do not do any preprocessing
to adjust the images. For AFFDEX the considered emotion has
to surpass the threshold value of threshold = 50, and be the
maximum value among the emotions. For any other case, we
consider the emotion as neutral, as used in the SDK examples.
The results are shown in the confusion matrix in Table III.
TABLE III
C ONFUSION MATRIX FOR THE AFFDEX SYSTEM IN
Neutral
Happy
Sad
Surprise
Fear
Disgust
Anger
Neutral
203
38
82
37
42
75
62
Happy
2
159
7
1
5
41
47
Sad
0
0
5
0
0
1
0
Surprise
20
11
1
59
90
2
1
Fear
4
0
1
0
2
1
1
THE
CAFE
Disgust
1
5
9
1
0
67
79
DATASET.
Anger
0
0
2
0
0
4
11
Fig. 9. NAO robot watching the screen, and then the analysis is done through
the visual capture of the robot.
For the CNN case, we compare the classification using
different cascade classifiers for the face detection. First, using
the default classifier in EMGUCV. Next, using 2 available
classifiers in EMGUCV, and then using the cascade classifier
in the AFFDEX SDK. Finally we retrain the CNN using the
NIMH-ChEF dataset. The confusion matrix for the best system
is in Table IV.
The summary of the results from using the AFFDEX SDK,
and the CNN systems are shown in Table V. The table also
reflects the detected faces from the CAFE dataset.
978-1-7281-1145-2/19/$31.00 ©2019 IEEE
In this test, we put the robot separated from the monitor as
in the Fig. 9. The robot processes every frame using either
AFFDEX SDK or the CNN. The analysis we display are
through the eyes of the robot, and not from the analysis
directly in the computer, to ensure the quality of the system in
the robot directly and with natural lightning conditions. The
objective is to simulate measuring reactions in real time, using
the NAO robot.
We use 5 videos from the AM-FED dataset that get to a
value close to 1.0 in the smile metrics. Next, we compare each
video with the AFFDEX SDK Fig. 10 and the CNN Fig. 11.
149
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on January 06,2024 at 04:35:45 UTC from IEEE Xplore. Restrictions apply.
For the comparison we consider joy intensity as a measure
of happiness. Although this is not a children-based dataset,
we consider making an analysis of this dataset because as far
as we know, is the only available and open source database
that has an analysis frame by frame of one of the six basic
emotions.
Fig. 12. Comparison between the smile metrics from the AM-FED dataset
id 09e3a5a1-824e-4623-813b-b61af2a59f7c, AFFDEX SDK, and the CNN
measurements of joy.
Fig. 10. Example of a processed image from the NIMH-ChEF dataset using
AFFDEX SDK.
Fig. 13. Comparison between the smile metrics from the AM-FED dataset
id 2f88bbb8-51ef-42bf-82c7-e0f341df1d88, AFFDEX SDK, and the CNN
measurements of joy.
Fig. 11. Example of a processed image from the NIMH-ChEF dataset using
a CNN.
The comparison for the different systems in the 5 videos are
in Figs. 12-16. To calculate the accuracy of the methods we
calculate the error between the systems and the smile metrics,
(Table VI). The error is given by,
P
((vm − vc )2 )
error =
,
(1)
#f rames
where vm is the smile metrics value, and vc is the calculated
value for joy.
C. Test 3
Fig. 14. Comparison between the smile metrics from the AM-FED dataset
id 3c06b2cd-62bb-4308-aefd-b79d9de748b7, AFFDEX SDK, and the CNN
measurements of joy.
In this test we make a comparison of the classification
in both systems for 5 videos in the EmoReact dataset. The
978-1-7281-1145-2/19/$31.00 ©2019 IEEE
150
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on January 06,2024 at 04:35:45 UTC from IEEE Xplore. Restrictions apply.
Fig. 15. Comparison between the smile metrics from the AM-FED dataset
id 0d48e11a-2f87-4626-9c30-46a2e54ce58e, AFFDEX SDK, and the CNN
measurements of joy.
Fig. 17. Frame analysis of the EmoReact Dataset, from the NAO Robot using
AFFDEX.
Fig. 16. Comparison between the smile metrics from the AM-FED dataset
id 1e7bf94c-02a5-48de-92bd-b0234354dbd5, AFFDEX SDK, and the CNN
measurements of joy.
Fig. 18. Frame analysis of the EmoReact Dataset, from the NAO Robot using
CNN.
EmoReact Dataset is different from the AM-FED dataset,
as it only gives a summary of the reactions in a video.
The measurements in the EmoReact dataset are: Curiosity,
Uncertainty, Excitement, Happiness, Surprise, Disgust, Fear,
Frustration, and Valence giving 1 or 0 for the present emotions
and percentage for Valence. From this set we consider only
Happiness, Surprise, Disgust, Fear and Frustration, as these
are a subset of the basic emotions.
As in Test 1, we separate the robot from the screen, and
analyze the first 5 videos of the test set from EmoReact
frame by frame using Affdex and the CNN as in Figs. 17,18.
Then, we calculate the average for the 5 emotions in the
whole video. Finally, we calculate the error between the given
values, and the calculated. The calculated and given values are
summarized in Table VII.
TABLE VII
R ESULTS BY THE GIVEN VALUE AND THE MEASURED VALUE FOR
H APPINESS , S URPRISE , D ISGUST, F EAR AND A NGER IN 5 VIDEOS OF THE
E MO R EACT DATASET
Happiness
CNN Joy
Affdex Joy
Surprise
CNN Surprise
Affdex Surprise
Disgust
CNN Disgust
Affdex Disgust
Fear
CNN Fear
Affdex Fear
Frustration
CNN Anger
Affdex Anger
TABLE VI
E RROR COMPARISON BETWEEN THE MEASUREMENT OF JOY OR
HAPPINESS IN THE DIFFERENT SYSTEMS AND THE SMILE METRIC FROM
THE AM-FED DATASET.
Video 1
Video 2
Video 3
Video 4
Video 5
Average
AFFDEX
0.002728
0.363869
0.030868
0.072483
0.279876
0.149965
CNN
0.006761
0.180762
0.073629
0.219818
0.227063
0.141606
978-1-7281-1145-2/19/$31.00 ©2019 IEEE
Video 1
0
0.622983
0.45577
0
0.039917
0.003736
0
0.011083
0.006335
0
0.007167
2.97E-05
0
0.036983
0.00028
Video 2
1
0.085684
1.45E-05
0
0.123789
0.052033
0
0.016421
0.003793
0
0.055684
1.7E-05
0
0.076684
0.000148
Video 3
1
0.669042
0.162013
0
0.030722
0.003821
0
0.008528
0.002452
0
0.007125
0.013329
0
0.036903
0.002232
Video 4
1
1
0.927167
0
0
0.013467
0
0
4.52E-05
0
0
4.58E-08
0
0
4.8E-07
Video 5
1
0.688294
0.436498
0
0.114235
0.039121
0
0.021529
0.000201
0
0.007824
6.15E-05
0
0.013647
5.03E-05
Finally, we calculate the error for both systems, for each
video (Table VIII).
151
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on January 06,2024 at 04:35:45 UTC from IEEE Xplore. Restrictions apply.
to see how the frame rate is affected. Thus, improving the
CNN system and using ensemble methods, we will be able to
increase even further the emotion recognition in the future.
TABLE VIII
E RROR
COMPARISON BETWEEN A FFDEX AND THE
THE E MO R EACT DATASET.
Video 1
Video 2
Video 3
Video 4
Video 5
Average
Affdex
0.466151
1.055977
0.859821
0.086346
0.602936
0.614246
CNN
FOR
5 VIDEOS IN
CNN
0.718133
1.186895
0.414236
0
0.468941
0.557641
ACKNOWLEDGMENT
IV. C ONCLUSION AND F UTURE W ORK
In this study we compare a FACS based system and a CNN
with cascade classifiers to classify emotion in children in the
CAFE dataset, and using the NAO robot. From the results
in Table V, the global emotion classification accuracy of the
CNN is better than the AFFDEX system, but the face detection
in AFFDEX is slightly better. The best classification was 535
correct out of 1192 faces for the CNN, detecting 1172 faces. In
contrast, AFFDEX classifies correctly 506, and detects 1179
faces. Nevertheless, it is important to remember that the CNN
based system has room for improvement as the number of
used faces for training is only 3% in comparison to the FACS
system. From the confusion matrix in Tables I and IV we can
see that the CNN system is not good detecting disgust. One
reason is that the Affectnet dataset has very few examples in
comparison to the other classes.
In Test 2, and Test 3, the CNN results are slightly better than
the AFFDEX, this is probably due to the fact that the CNN
analysis is almost twice as fast than AFFDEX, 6.32 (fps) and
3.38 fps respectively. The percentage difference in the error
for detecting joy in AM-FED video is around 0.8359% and
5.66% for the classification of reaction in EmoReact.
Both these systems require that the individual look frontal
into the camera, but the coded version of the NAOqi in .NET
allows us to easily add several cameras besides the camera in
the NAO or other inputs, such as audio. Thus, future analysis
can be done in ensemble methods, using several cameras and
inputs in real time.
For example, if we combine both systems for Test 1, the
accuracy increases to 46.05% (549 faces correct). To combine
them, first we run AFFDEX SDK, if we find disgust that is the
output, any other case we run the CNN system. The confusion
matrix for the ensemble system is in Table IX.
TABLE IX
C ONFUSION M ATRIX FOR ENSEMBLE SYSTEM IN T EST 1.
Neutral
Happy
Sad
Surprise
Fear
Disgust
Anger
Neutral
174
2
51
7
40
12
9
Happy
19
193
15
25
42
42
58
Sad
3
11
13
4
6
18
9
Surprise
23
1
4
54
43
0
4
Fear
0
0
0
4
2
0
0
Disgust
1
5
9
1
0
67
79
Anger
10
3
16
8
7
52
46
From combining both systems, we see that the accuracy
increases, this system still needs to be tested on the NAO
978-1-7281-1145-2/19/$31.00 ©2019 IEEE
This work is part of a project1 (#15198) that is included
in the research program Technology for Oncology, which
is financed by the Netherlands Organization for Scientific
Research (NWO), the Dutch Cancer Society (KWF), the TKI
Life Sciences & Health, Asolutions, Brocacef, Cancer Health
Coach, and Focal Meditech. The research consortium consists
of the Centrum Wiskunde & Informatica, Delft University of
Technology, the Academic Medical Center, and the Princess
Maxima Center.
A NNEX
Code Available at:
https://github.com/steppenwolf0/emotioNAO
R EFERENCES
[1] A BADI , M., BARHAM , P., C HEN , J., C HEN , Z., DAVIS , A., D EAN ,
J., D EVIN , M., G HEMAWAT, S., I RVING , G., I SARD , M., ET AL .
Tensorflow: a system for large-scale machine learning. In OSDI (2016),
vol. 16, pp. 265–283.
[2] C HENG , L., S UN , Q., S U , H., C ONG , Y., AND Z HAO , S. Design and
implementation of human-robot interactive demonstration system based
on kinect. In Control and Decision Conference (CCDC), 2012 24th
Chinese (2012), IEEE, pp. 971–975.
[3] C OHEN , I., L OOIJE , R., AND N EERINCX , M. A. Child’s recognition
of emotions in robot’s face and body. In Proceedings of the 6th
international conference on Human-robot interaction (2011), ACM,
pp. 123–124.
[4] E GGER , H. L., P INE , D. S., N ELSON , E., L EIBENLUFT, E., E RNST,
M., T OWBIN , K. E., AND A NGOLD , A. The nimh child emotional
faces picture set (nimh-chefs): a new set of children’s facial emotion
stimuli. International journal of methods in psychiatric research 20, 3
(2011), 145–156.
[5] E KMAN , P. An argument for basic emotions. Cognition & emotion 6,
3-4 (1992), 169–200.
[6] E KMAN , P., AND ROSENBERG , E. L. What the face reveals: Basic
and applied studies of spontaneous expression using the Facial Action
Coding System (FACS). Oxford University Press, USA, 1997.
[7] FAN , Y., L U , X., L I , D., AND L IU , Y. Video-based emotion recognition
using cnn-rnn and c3d hybrid networks. In Proceedings of the 18th
ACM International Conference on Multimodal Interaction (2016), ACM,
pp. 445–450.
[8] G E , S. S., S AMANI , H. A., O NG , Y. H. J., AND H ANG , C. C. Active
affective facial analysis for human-robot interaction. In Robot and
Human Interactive Communication, 2008. RO-MAN 2008. The 17th
IEEE International Symposium on (2008), IEEE, pp. 83–88.
[9] G ORDON , G., S PAULDING , S., W ESTLUND , J. K., L EE , J. J., P LUM MER , L., M ARTINEZ , M., DAS , M., AND B REAZEAL , C. Affective
personalization of a social robot tutor for children’s second language
skills. In AAAI (2016), pp. 3951–3957.
[10] H ERSHEY, S., C HAUDHURI , S., E LLIS , D. P., G EMMEKE , J. F.,
JANSEN , A., M OORE , R. C., P LAKAL , M., P LATT, D., S AUROUS ,
R. A., S EYBOLD , B., ET AL . Cnn architectures for large-scale audio
classification. In Acoustics, Speech and Signal Processing (ICASSP),
2017 IEEE International Conference on (2017), IEEE, pp. 131–135.
[11] JAQUES , N., M C D UFF , D., K IM , Y. L., AND P ICARD , R. Understanding and predicting bonding in conversations using thin slices of
facial expressions and body language. In International Conference on
Intelligent Virtual Agents (2016), Springer, pp. 64–74.
1 Improving Childhood Cancer Care when Parents Cannot be There –
Reducing Medical Traumatic Stress in Childhood Cancer Patients by Bonding
with a Robot Companion.
152
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on January 06,2024 at 04:35:45 UTC from IEEE Xplore. Restrictions apply.
[12] J MOUR , N., Z AYEN , S., AND A BDELKRIM , A. Convolutional neural
networks for image classification. In 2018 International Conference on
Advanced Systems and Electric Technologies (IC ASET) (2018), IEEE,
pp. 397–402.
[13] K ASHII , A., TAKASHIO , K., AND T OKUDA , H. Ex-amp robot: Expressive robotic avatar with multimodal emotion detection to enhance
communication of users with motor disabilities. In Robot and Human
Interactive Communication (RO-MAN), 2017 26th IEEE International
Symposium on (2017), IEEE, pp. 864–870.
[14] KOLLAR , T., T ELLEX , S., ROY, D., AND ROY, N. Toward understanding natural language directions. In Proceedings of the 5th ACM/IEEE
international conference on Human-robot interaction (2010), IEEE
Press, pp. 259–266.
[15] L E C UN , Y., B OTTOU , L., B ENGIO , Y., AND H AFFNER , P. Gradientbased learning applied to document recognition. Proceedings of the
IEEE 86, 11 (1998), 2278–2324.
[16] L EVI , G., AND H ASSNER , T. Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In Proceedings
of the 2015 ACM on international conference on multimodal interaction
(2015), ACM, pp. 503–510.
[17] L I , C., YANG , C., L IANG , P., C ANGELOSI , A., AND WAN , J. Development of kinect based teleoperation of nao robot. In 2016 International
Conference on Advanced Robotics and Mechatronics (ICARM) (2016),
IEEE, pp. 133–138.
[18] L I , Q., C AI , W., WANG , X., Z HOU , Y., F ENG , D. D., AND C HEN , M.
Medical image classification with convolutional neural network. In Control Automation Robotics & Vision (ICARCV), 2014 13th International
Conference on (2014), IEEE, pp. 844–848.
[19] L IU , R., S ALISBURY, J. P., VAHABZADEH , A., AND S AHIN , N. T. Feasibility of an autism-focused augmented reality smartglasses system for
social communication and behavioral coaching. Frontiers in pediatrics
5 (2017), 145.
[20] L O B UE , V., AND T HRASHER , C. The child affective facial expression
(cafe) set: validity and reliability from untrained adults. Frontiers in
psychology 5 (2015), 1532.
[21] L ÓPEZ R ECIO , D., M ÁRQUEZ S EGURA , E., M ÁRQUEZ S EGURA , L.,
AND WAERN , A. The nao models for the elderly. In Proceedings of
the 8th ACM/IEEE international conference on Human-robot interaction
(2013), IEEE Press, pp. 187–188.
[22] L OPEZ -R INCON , A., T ONDA , A., E LATI , M., S CHWANDER , O., P I WOWARSKI , B., AND G ALLINARI , P. Evolutionary optimization of convolutional neural networks for cancer mirna biomarkers classification.
Applied Soft Computing 65 (2018), 91–100.
[23] L UCEY, P., C OHN , J. F., K ANADE , T., S ARAGIH , J., A MBADAR , Z.,
AND M ATTHEWS , I.
The extended cohn-kanade dataset (ck+): A
complete dataset for action unit and emotion-specified expression. In
Computer Vision and Pattern Recognition Workshops (CVPRW), 2010
IEEE Computer Society Conference on (2010), IEEE, pp. 94–101.
[24] L UI , J. H., S AMANI , H., AND T IEN , K.-Y. An affective mood booster
robot based on emotional processing unit. In Automatic Control
Conference (CACS), 2017 International (2017), IEEE, pp. 1–6.
[25] M C D UFF , D. Discovering facial expressions for states of amused,
persuaded, informed, sentimental and inspired. In Proceedings of the
18th ACM International Conference on Multimodal Interaction (2016),
ACM, pp. 71–75.
[26] M C D UFF , D., K ALIOUBY, R., S ENECHAL , T., A MR , M., C OHN , J.,
AND P ICARD , R. Affectiva-mit facial expression dataset (am-fed): Naturalistic and spontaneous facial expressions collected. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops (2013), pp. 881–888.
[27] M C D UFF , D., M AHMOUD , A., M AVADATI , M., A MR , M., T URCOT,
J., AND K ALIOUBY, R. E . Affdex sdk: a cross-platform real-time
multi-face expression recognition toolkit. In Proceedings of the 2016
CHI Conference Extended Abstracts on Human Factors in Computing
Systems (2016), ACM, pp. 3723–3726.
[28] M ISKAM , M. A., S HAMSUDDIN , S., S AMAT, M. R. A., Y USSOF, H.,
A INUDIN , H. A., AND O MAR , A. R. Humanoid robot nao as a teaching
tool of emotion recognition for children with autism using the android
app. In 2014 International Symposium on Micro-NanoMechatronics and
Human Science (MHS) (2014), IEEE, pp. 1–5.
[29] M OLLAHOSSEINI , A., H ASANI , B., AND M AHOOR , M. H. Affectnet:
A database for facial expression, valence, and arousal computing in the
wild. arXiv preprint arXiv:1708.03985 (2017).
978-1-7281-1145-2/19/$31.00 ©2019 IEEE
[30] M UKHERJEE , S., PARAMKUSAM , D., AND DWIVEDY, S. K. Inverse
kinematics of a nao humanoid robot using kinect to track and imitate
human motion. In Robotics, Automation, Control and Embedded Systems
(RACE), 2015 International Conference on (2015), IEEE, pp. 1–7.
[31] N OJAVANASGHARI , B., BALTRU ŠAITIS , T., H UGHES , C. E., AND
M ORENCY, L.-P. Emoreact: a multimodal approach and dataset for
recognizing emotional responses in children. In Proceedings of the 18th
ACM International Conference on Multimodal Interaction (2016), ACM,
pp. 137–144.
[32] O QUAB , M., B OTTOU , L., L APTEV, I., AND S IVIC , J. Learning and
transferring mid-level image representations using convolutional neural
networks. In Proceedings of the IEEE conference on computer vision
and pattern recognition (2014), pp. 1717–1724.
[33] R AWAT, W., AND WANG , Z. Deep convolutional neural networks for
image classification: A comprehensive review. Neural computation 29,
9 (2017), 2352–2449.
[34] ROA BARCO , L. Analysis of facial expressions in children: Experiments
based on the db child affective facial expression (cafe).
[35] ROBOTICS , A. Naoqi framework. Consulted in http://doc. aldebaran.
com/1-14/dev/naoqi/index. html. Google Scholar.
[36] S AHINER , B., C HAN , H.-P., P ETRICK , N., W EI , D., H ELVIE , M. A.,
A DLER , D. D., AND G OODSITT, M. M. Classification of mass and
normal breast tissue: a convolution neural network classifier with spatial
domain and texture images. IEEE transactions on Medical Imaging 15,
5 (1996), 598–610.
[37] S CHAAFF , K., AND S CHULTZ , T. Towards an eeg-based emotion
recognizer for humanoid robots. In Robot and Human Interactive
Communication, 2009. RO-MAN 2009. The 18th IEEE International
Symposium on (2009), IEEE, pp. 792–796.
[38] S HAHRIARI , B., S WERSKY, K., WANG , Z., A DAMS , R. P., AND
D E F REITAS , N. Taking the human out of the loop: A review of bayesian
optimization. Proceedings of the IEEE 104, 1 (2016), 148–175.
[39] S HI , S. Emgu CV Essentials. Packt Publishing Ltd, 2013.
[40] S OUZA , C. R. The accord .net framework. São Carlos, Brazil:
http://accord-framework. net (2014).
[41] S PAULDING , S., G ORDON , G., AND B REAZEAL , C. Affect-aware
student models for robot tutors. In Proceedings of the 2016 International
Conference on Autonomous Agents & Multiagent Systems (2016), International Foundation for Autonomous Agents and Multiagent Systems,
pp. 864–872.
[42] V IOLA , P., AND J ONES , M. Rapid object detection using a boosted
cascade of simple features. In Computer Vision and Pattern Recognition,
2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society
Conference on (2001), vol. 1, IEEE, pp. I–I.
[43] Z HANG , L., J IANG , M., FARID , D., AND H OSSAIN , M. A. Intelligent
facial emotion recognition and semantic-based topic detection for a
humanoid robot. Expert Systems with Applications 40, 13 (2013), 5160–
5168.
[44] Z HANG , L., M ISTRY, K., J IANG , M., N EOH , S. C., AND H OSSAIN ,
M. A. Adaptive facial point detection and emotion recognition for a
humanoid robot. Computer Vision and Image Understanding 140 (2015),
93–114.
153
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on January 06,2024 at 04:35:45 UTC from IEEE Xplore. Restrictions apply.
Download