Emotion Induction

advertisement
Chapter 4
EMOTIONAL DATABASE
Chapter 4
EMOTIONAL DATABASE
4.1
Difficulties to acquire emotional data
The first problem faced when investigating emotions contained in speech is to choose
a valid database, which is going to be the basis of the subsequent research work.
Unfortunately, the scarcity of available emotional databases makes the recording stage an
almost ineludible task within the process. A clear difference in the performance of an
emotion recognizer can be achieved depending on which kind of speech is used.
Generally, three different categories of emotional speech are considered:

Spontaneous speech

Acted speech

Elicited speech.
All three groups have both advantages and drawbacks, and none of them can be
pointed out as generally optimal. The selection of the database is strongly dependent on
the application in which the emotional recognizer is going to be employed. Also the
categories that are relevant in trying to establish the correspondences between emotions
and speech is to a certain extent depending on the task, i.e. different applications may
gain from different categorizing. In the framework of this work, the target scenario is the
Sony entertainment robot AIBO. To that end, an emotional database has been recorded,
simulating different possible situations, which comprises all the desired emotions. From
the application point of view it is interesting to have five different emotional states, angry,
happy, sad, bored and neutral.
53
Chapter 4
EMOTIONAL DATABASE
Another problem that has to be accounted when a database is chosen is the cultural
dependencies found in the way the emotions are expressed. Many studies try to find out
the extent to which emotional expression is psycho-biological or culturally determined.
For example, Scherer [Sch00] explored the existence of a universal psychobiological
mechanism of emotion in speech across languages and cultures by studying the
recognition of 5 emotions in nine languages obtaining 66% of accuracy. In [Abe01]
recordings of a Swedish speaker uttering a phrase while expressing different emotions
was interpreted by listeners with different languages: Swedish, English, Finnish and
Spanish. Results show that the native listeners were the most successful in recognizing
them appropriately. Nevertheless, another study [Tic00] carried out comparing the crosscultural decoding of emotions between Japanese and English subjects, suggests that the
vocal effects of possibly quasi-universal psycho-biological response mechanisms may be
present.
Following sub-sections resume the benefits and drawbacks of the three mainly kinds
of speech.
4.1.1 Spontaneous speech.
Spontaneous speech is often argued to contain the most direct and authentic emotions,
but the difficulties in collecting this kind of speech are also extensive. In the ideal
condition speakers should be recorded without knowing about it, so that they behave
completely naturally, but this kind of data collection rises difficulties, since such a routine
is ethically problematic (s. [Cam00, Cam01]). Problems with spontaneous speech can also
cause legal copyright problems. However, although this kind of data is difficult to collect,
there exist corpora of spontaneous speech, mainly consisting of clips from different
television programs, but with significant distribution limitations.
Another weakness of this kind of speech is that data in the corpus must be
categorized. Emotional categories are quite fuzzy in their definitions, and different
researches use different sets. Systematic and careful evaluations of tagsets used for
labeling emotions are generally lacking and the labeling process becomes hard and
expensive.
54
Chapter 4
EMOTIONAL DATABASE
Examples of natural databases available are the Belfast database, which contains
audiovisual recordings of 100 English speakers exhibiting relatively spontaneous emotion
and is used in e.g. [Scö01, Cow00]; the Leeds-Reading Emotion in Speech Corpus, e.g.
[Gre95], the JST database, e.g. [Cam01] and the SUSAS corpus [Han99], which consists
of air force pilots conversation and is therefore still less common than many everyday
situations. Aviation data, i.e. crew conversations in cases where the aircraft is crashing,
has also been used (by e.g. [Bre83] or [Wil69]) as well as the radio recordings of the
reporting of the Hindenburgh catastrophe (e.g. used by [Wil69]). There are also other
researches using spontaneous speech but all of them have suffered ethical critics.
4.1.2 Acted speech.
Given the difficulty of inducing or observing naturally occurring vocal expressions of
emotion, most researches in this area have used actors as subjects, asking them to vocally
portray different emotions, and have analyzed the acoustic features of the recorded
portrayals.
Acted speech does not have the same ethical problems that are present by collecting
spontaneous speech, however the degree of naturalness is often questioned. Acted speech
can be recorded from different sources, sometimes professional actors are employed (s.
[Ban96]), in other cases non-professional actors, students of drama or even any other
students are asked to utter emotional corpus. Of course the quality of the acting could be
suspected to differ between diverse recordings and these differences regarding the quality
of acting have to be taken into account as well.
In the first place the quality of acted speech is a function of the quality of the acting
performed, which might affect the manifestations of the emotions. But there are further
unclear parts of using acted speech; the most important uncertainty is whether acted
speech really can be said to reflect authentic emotions. Some reports [Gus01] believe that,
due to the exaggerated nature of acted speech, it is not possible to generalize from acted
emotional speech to natural speech, even though high recognition rates often can be found
in those former experiments. Obviously there is an inverse relation between naturalness
and ease of acquisition. Acted speech is an indication of how people believe that
emotions should be expressed in speech, not of how emotions are actually expressed
55
Chapter 4
EMOTIONAL DATABASE
[Sti01]. This indicates that acted speech is more stereotypical, and that the expression of
emotions is more extreme than in spontaneous speech. For a speech synthesis application
this might not be a problem, perhaps it is rather an advance to use stereotypical emotional
expressions. Giving the most prototypical and easily interpretable emotive correlates,
instead of real, would even be profitable in synthesized speech. These stereotypes could
be universally understood in spite of their lack of spontaneity. On the contrary, in speech
recognition this mismatch between ideal and reality gives rise to problems. Since there is
no unanimous way to express emotions, because it strongly depends on many factors such
as social environment or speaker’s personality, automatic recognition systems should be
capable of interpreting a wide range of variations in the emotional expression. In other
words, in recognizing speech we have to cope with the complexity of reality.
4.1.3 Elicited speech.
The basis of elicited speech resides in emotion induction. One of the major
requirements for the empirical study of the effects of the speaker emotional state on
acoustic voice parameters is the ability to induce affective and attitudinal states in a
reliable and realistic fashion. Several techniques have been developed in the literature to
induce affective states in a controlled way. These range from the reading of positive or
negative self-statements through the use of music and the presentation of films to the
threat of having to speak in public. For instance, subjects watch a film, which should
evoke specific emotions, and then they have to retell the film to the experimenter. Here
the idea is that the speech shall be colored by the emotion induced. It is also possible to
put a subject into a situation meant to evoke a specific emotion, and then record his
speech. However, this method suffers from ethical problems, i.e. it is not fully ethical to
scare someone, and then record his speech. In [Gus01] is doubted whether it is even more
unethical to do this, than just to record someone who is already scared. As a result of this
problem the induced or elicited emotions are often too mild, as if there were an inverse
relation between the strength of the induction and the unethical value.
Various techniques using mental imagery have been used effectively to induce
affective states in which physiological, vocal and facial reactions congruent with the
target states could be elicited for a range of emotions and attitudinal states. Finally, within
56
Chapter 4
EMOTIONAL DATABASE
the fields of speech science and human factors, interactive tasks and games on computers
have been used to induce states of high cognitive load and stress, and a number of
emotional states. This technique seems particularly relevant to research involving
automatic and computer controlled speech interfaces. Wizard of Oz (WoZ) techniques are
also widespread employed. There, a real situation is presented to the subject and his
emotional reactions are captured. This technique is used during the present research
through the scenario “one day with AIBO”.
A wide range of procedures has been attempted to provoke emotions in an artificial
way. The induction method has the positive feature that it gives control over the stimulus,
on the other hand, different subjects may react differently on the same stimulus. The
validity of such elicited, or induced, emotional speech depends to a large extent on how
successful the induction process is.
Studies, which have used induced emotional speech are e.g. [Ski35], [Fri62], [Hec68]
or [Iid98].
4.2
Framework
Research made during this thesis is oriented to the AIBO entertainment robot,
developed by SONY, which has the capability to communicate with the world around it
through the senses of sight, sound and touch. In order to obtain relevant results, it is
desired to have a speech database, as close as possible to spontaneous emotional speech
in the target scenario.
With that purpose in mind, different stories in the context “one day with AIBO” have
been designed, taken into account that approximately 30 commands in five emotions
(angry, happy, sad, bored and neutral) should be included. Such stories were recorded by
a professional speaker; with the aim to introduce subsequent speakers into the intended
emotion.
Recordings of the database are thus focused on the commands to which AIBO
usually attends. Before further details of the database are given, one observation must be
considered: In order to obtain enough data to deal with the speaker dependent
experiments, two subjects, one male and one female, have been selected from the
database and larger amount of data has been from them recorded:
57
Chapter 4
EMOTIONAL DATABASE
 Speaker A: One male native German speaker. AIBO stories are recorded twice,
corresponding to the speaker ids id0013 and id0014 (see table 4.1). AIBO commands are
recorded twice.
 Speaker B: One female non-native English speaker. AIBO stories are recorded
twice, corresponding to the speaker ids id0029 and id0030 (see table 4.1). AIBO
commands are recorded twice.
4.3 Recording sessions
Collection of the database was completed in the recording studio of the Advanced
Technology Centre of Stuttgart (ATCS), property of Sony International Europe GmbH.
The software used in the recording process was implemented by Sony at the same
location, i.e. ATCS, and is called Speech Recording System Program V. 3.0.0.4.
Recordings were made with two different microphones. A Sony C38B high quality
microphone was situated close to the speaker and conforms the left channel. In addition, a
Sony WM4108B microphone was distanced 30 cm in front of the speaker as far-field
microphone and its signal was set in the right channel. Both channels were recorded with
a frequency of 48 KHz. Then channels are converted to a sampling frequency of 16KHz.
The present work only considers the closer input, whereas the far-field signal is kept for
further research.
As it has been previously introduced in section 4.2, two different kind of recordings
are performed:
 AIBO commands is a dataset of read speech consisting in the AIBO commands
read one after another in each one of the five emotional states1 considered for this work.
For the AIBO commands data acquisition, utterances are recorded as read speech and
therefore no story is performed; only the commands are prompted. Commands are simply
asked to be uttered within certain emotional content. Since these commands were
recorded, at a first step, in order to increase the amount of data for the speaker dependant
experiments, recordings of this nature only exist for speakers A and B. A database of only
neutral commands uttered by 7 male and 6 female speakers is also used for purposes of
1
Anger, boredom, happiness, neutrality and sadness.
58
Chapter 4
EMOTIONAL DATABASE
experiment 8.3.2.1, whose findings question the absence of emotional content in the
neutral utterances resulting from the AIBO stories. This fact comes from the intrinsic
emotional meaning of the commands, e.g. “Let’s play” has a propensity to be uttered as
happy and “Be quiet!” dispose angry intentions.
 One day with AIBO database contains emotional samples obtained as elicited
(WOZ) speech. People are put in an emotional state by some context action and then
asked to read the commands. Subjects are asked to sit in front of a screen and to listen to
one recording through the phone heads. This recording, designed to supply the emotional
context, was previously recorded by a professional speaker. At the same time that they
listen to the story, they can read it on the screen. When they are required to utter a
command, it is prompted on the screen. The emotional content in which this command
should be uttered is unequivocally given through the story context; however, an icon is
presented on the screen next to the sentence for its absolute verification.
Speech files are automatically labelled within the different emotions during the
recording session. The story was designed taking into account that at least the 26 AIBO
commands uttered in 5 different emotional states should be included. The speaker follows
all the situations conducted until the end of the story, which is, to add some non-technical
information, a happy end. The labelling of the database is made accordingly to the
emotion that is supposed to be uttered in each situation. That means that this work will
deal with intended emotional expression without re-labelling through listening tests. This
position defends the idea that emotions should be recognised from the natural expression
of the speakers, instead of restricting the study to “exaggerated” ways of emotional
expression. Nevertheless it would be interesting to contrast results with an appropriate
labelled database, which is proposed for further work.
Since the recording sessions have taken place simultaneously with the thesis
development, the amount of data has increased successively. The following database
matches the data available at the closing stage of this work.
59
Chapter 4
EMOTIONAL DATABASE
4.3.1 One day with AIBO
Following the procedure formerly described, through which the subjects are put into
an emotional context, 30 speakers are recorded. Information about the speakers is given
in following table 4.1.
The labels “good emotional performer” and “one of the best emotional performers”
result from the criteria of the recording staff, who attended all the sessions. However, it
must be noted that this selection is based exclusively on general performance of the
speakers at the recording time and not on later listening tests. The use of different sets of
speakers to carry out the experimental enquiries is detailed in chapters 8 and 9.
SPEAKER ID
SEX
AMOUNT OF DATA
angry/bored/happy/neutral/sad
COMMENTS
id0001
Male
40/31/33/34/30
Used for speaker dependant
experiments in 8.1.1
id0002
Male
40/31/33/34/30
Selected as one of the best
emotional performers
id0003
Male
40/31/33/34/30
-
id0004
Male
40/31/33/34/30
Selected as a good emotional
performers
id0005
Female
39/34/33/34/30
Selected as one of the best
emotional performers
id0006
Male
40/31/33/34/30
Used to test speaker
independent case in 8.1.2
id0007
Female
40/31/33/34/30
-
id0008
Male
40/31/33/34/30
Used for speaker dependant
experiments in 8.1.1.
id0009
Female
40/31/33/34/30
Selected as a good emotional
performer
id0010
Male
40/31/33/34/30
Discarded because of the bad
recording conditions
id0011
Female
40/31/33/34/30
-
id0012
Female
40/31/33/34/30
-
id0013
Male
40/31/32/34/30
Speaker A
60
Chapter 4
EMOTIONAL DATABASE
id0014
Male
40/30/33/34/30
Speaker A
id0015
Male
40/31/33/34/30
Selected as one of the best
emotional performers
id0016
Female
40/31/33/34/30
Selected as a good emotional
performer
id0017
Male
40/31/33/34/30
Selected as a good emotional
performer
id0018
Male
40/31/33/34/30
-
id0019
Female
40/31/33/34/30
Selected as one of the best
emotional performers
id0020
Male
40/31/33/34/30
“
id0021
Female
40/31/32/34/30
Selected as a good emotional
performer
id0022
Female
40/31/33/34/30
-
id0023
Male
40/31/33/34/30
-
id0024
Female
40/31/33/34/30
Selected as one of the best
emotional performers
id0025
Female
38/31/33/34/30
Selected as a good emotional
performer
id0026
Female
40/31/33/34/30
-
id0027
Male
40/31/33/34/30
-
id0028
Male
40/31/33/34/30
-
id0029
Female
38/30/32/33/28
Speaker B
id0030
Female
39/30/32/33/29
Speaker B
Table 4.1. Database recorded by means of the AIBO scenarios.
4.3.2 AIBO commands
These recordings are the result of reading the commands without an emotional
context. Speakers A and B are recorded in five different emotions in order to obtain a
larger amount of data for speaker dependent classification tasks. On the other hand,
remaining speakers in table 4.2 are only recorded in the neutral emotion. All the
utterances correspond to commands that AIBO is capable to recognize and “understand”
61
Chapter 4
EMOTIONAL DATABASE
SPEAKER ID
SEX
AMOUNT OF DATA
A
Male
173 angry - 172 bored - 174 happy - 173 neutral - 172sad
B
Female
98 commands uttered in each emotion
id1001
Male
69 neutral commands
id1002
Male
66 neutral commands
id1003
Female
76 neutral commands
id1004
Male
73 neutral commands
id1005
Female
71 neutral commands
id1006
Male
76 neutral commands
id1007
Female
76 neutral commands
id1008
Male
76 neutral commands
id1009
Female
67 neutral commands
id1010
Male
76 neutral commands
id1011
Male
76 neutral commands
id1012
Female
76 neutral commands
id1013
Female
76 neutral commands
Table 4.2. Database recorded from read AIBO commands.
62
Download