expression of emotions: cues for emotion recognition.

advertisement
Chapter 3
EXPRESSION OF EMOTIONS
Chapter 3
EXPRESSION OF EMOTIONS: CUES FOR EMOTION
RECOGNITION.
3.1 Introduction.
The purpose of this work is the classification of emotions through certain attributes
of the spoken communication. Therefore, it is worthwhile to go into a short overview to
see how emotions are expressed within the human communication and, particularly, how
emotions are enclosed within the oral expression.
3.1.1 Expressing emotions.
There is a large literature on the signs that indicate emotion, both within the
psychological tradition and beyond it. The vocal cue is one of the fundamental
expressions of emotions, on a par with facial expression. Primates, dolphins, dogs and all
the mammals have emotions and can convey them by vocal cues. Humans can express
their feelings by crying, laughing, shouting and also by more subtle characteristics of the
speech. Consequently, in expressing and understanding emotions, different types of
sources need to be considered. Emotional manifestation has also a wide range of somatic
correlates, including heart rate, skin resistivity, temperature, pupillarity diameter and
muscle activity. They have been widely used to identify emotion-related states, for
instance in lie detection. However, key studies convey the idea, that signs of emotion are
clearer identified through facial expression [Rus94] and voice. Most works in automatic
27
Chapter 3
EXPRESSION OF EMOTIONS
recognition of emotions use only a single modality, either speech or video as data, but
there are also some works that try to combine both sources to improve the recognition
performance (s. [Che98, Cow01, Iwa95]). Most of the literature concerning this field is
essentially based on the suggestion, stemming from Darwin, that signs of emotion are
rooted in biology and are therefore universal [Dar65].
3.1.2 Different kinds of information within the oral communication.
In general, speech presents two broad types of information. It carries linguistic
information insofar as it identifies qualitative targets that the speaker has attained in a
configuration that conforms to the rules of the language. On the other hand, paralinguistic
information is carried by allowed variations in the way that qualitative linguistic targets
are realised. This paralinguistic signal is non-linguistic and non-verbal but tells the
interlocutor about the speaker's current affective, attitudinal or emotional state. It also
includes the speaker's regional dialect and characteristic sociolect. Paralinguistic aspects
also conduct control of the active participation of time sharing in the conversation. The
paralinguistic markers, however, are to some extent culturally determined and their
respective interpretation must be learned [Mar97]. This non-verbal information is
primarily extracted from variations in pitch and intensity having no linguistic function
and from voice quality, related to spectral properties that aren’t relevant to word identity.
Emotional
Expression
Verbal
Expression
Linguistic
Information
Facial
Expresion
Acoustic
Information
Prosody
Voice quality
Figure 3.1. Different channels in the communication of human emotions.
28
Somatic
Cues
hearth rate, skin resistivity,
swear, pupilarity...
Chapter 3
EXPRESSION OF EMOTIONS
In terms of oral manifestation, emotion can be expressed in at least two verbal means:
linguistically (lexical, syntactic and semantic features) and acoustically. The semantic
information of a word, concerning its meaning, can be an important cue in emotion
recognition, since there are words that tend to be used only in certain situations, e.g.
swears words or some affect bursts (s. [Scö00, Sch94]). In the listening tests carried out
by Mark Schröder [Scö00], where a database of affect bursts was presented to 20
subjects, the overall recognition rate of the uttered emotion reached 81%. Since the
recordings were presented audio only and without context, affect bursts were found to be
an effective mean of expressing emotions. Nevertheless, this kind of linguistic indications
is not enough to assure the emotional intention of the speaker because, in some languages,
a swear word can also be said with friendly meaning and its connotation could only be
detected by non-semantic information.
As a result, spoken messages meaning depends, not only on what it’s said, but also
strongly on how it’s said. In [Cow95] two channels have been distinguished within human
interaction: one transmits explicit messages, and is more related to semantic issues; the
other transmits implicit messages about the speakers themselves or the situation and is
associated with acoustic properties of the speech (see figure 3.1). These two channels of
human communication interact: the implicit channel tells people “how to take” what is
transmitted through the explicit channel.
Though not all the studies on the subject of voice-emotion correlation agree about
which attributes of the speech signal are the most favourable to detect the speaker’s
emotional state, all of them coincide in the idea that emotion strongly influences the
acoustic characteristics of the speech. Furthermore, Branka Zei, phsicologyst by the
Universital Hospital of Geneva, refers to three clearly differentiated levels of influence of
emotion on speech:

suprasegmental: overall pitch, energy, timing, intonation contour.

segmental: articulation, timing.

“intrasegmental”: voice quality.
Although segmental features and the content of the utterance themselves carry
emotion, suprasegmental features, play an important role in conveying emotion [Ban96].
Murray and Arnott have conducted a literature review on human vocal emotion (table 3.1)
29
Chapter 3
EXPRESSION OF EMOTIONS
and concluded that in general, the correlation of the acoustic characteristics, both prosody
and voice quality, and the speaker’s emotional state are consistent among different
studies, with only minor differences being apparent [Mur93].
3.1.3 Vocal correlates of emotion and attitude.
A major question in work on acoustic correlates of emotion is to what extent
recognition of the speaker’s emotional state is due to pure acoustic indications and to
what extent it is due to bordering factors such as the implicit message included in the
utterance, the context or the social convention.
Speech Rate
Anger
Happiness
Sadness
Slightly
Faster or
Slightly
faster
slower
slower
Pitch
Very much
Average
higher
Pitch Range
Much wider
Much wider
Intensity
Higher
Higher
Voice
Breathy,
Breathy,
Quality
chest
blaring tone
Pitch
Abrupt on
changes
stressed
Articulation
Tense
Much higher
Smooth,
upward
inflections
Normal
Fear
Much faster
Disgust
Very much
faster
Slightly
Very much
Very much
lower
higher
lower
Slightly
narrower
Lower
Resonant
Much wider
Slightly
wider
Normal
Lower
Irregular
Grumble
voicing
chest tone
Wide,
Downward
inflections
downward
Normal
terminal
inflects
Slurring
Precise
Normal
Table 3.1. Emotions and speech parameters (from Murray and Arnott, 1996).
With the aim of examine how well emotions can be recognised from acoustic
properties of the speech signal, the following listening experiment was carried out by L.
Yang and N. Campbell in [Yan01]: 13 phonetically untrained native speakers of Chinese
(5 females and 8 males) and 5 Americans (1 female and 4 males), with no knowledge of
Chinese, were asked to identify 21 examples of a variety of emotional speech samples
uttered in Chinese. Results show that, despite the existence of some seeming
30
Chapter 3
EXPRESSION OF EMOTIONS
discrepancies, there is a basic correspondence between sound and meaning. For example,
strong disgust and dismissive emphasis, as well as the degree of intensity, were clearly
differentiated by both the Chinese and American respondents, and both had a single
unambiguous interpretation. Consequently, the fact that the American respondents do not
have any knowledge of Chinese and were unaware of the semantic content, and were still
able to differentiate the speaker state purely based on the sound pattern further supports
the idea of a sound-meaning link.
Another revealing study related to the recognition of emotions based only on
acoustic cues was performed by A. Tickle in [Tick00]. There, the comparison of crosscultural emotional decoding accuracy of vocalisations encoded by native English and
native Japanese speakers yield some language independent, and therefore message
independent, results: Japanese subjects decoded English vocalisations with greater
accuracy than they decoded Japanese vocalisations. This fact may suggest that, despite
possibly being exposed to less emotional expression, Japanese subjects are capable of
decoding emotions where acoustic correlates are due to psycho-biological response
mechanisms.
It is amply proved that certain emotional information is enclosed within the oral
communication relying only on the acoustic aspects of the speech signal. A huge number
of studies have been accomplished in order to determine which acoustic attributes,
contained in the oral expression, are the most advantageous candidates to unequivocally
detect the speaker’s emotional state. An important research is the one accomplished by
Murray and Arnott [Mur93], whose results, yielding the best acoustic attributes for
detecting basic emotions, are presented in table 3.1. This study, in spite of being one of
the most mentioned works in the field of emotion recognition through the speech signal,
is not the only one. In a recent review of research on the acoustic correlates of speaker
states, Pittam and Scherer [Pit93] have summarised the state of the literature to date as
follows:
 Anger/Frustration: Anger generally seems to be characterised by an increase in
mean F0, F0 variability, and mean energy. Further anger effects include increases in high
frequency energy and downward directed F0 contours. The rate of articulation usually
increases.
31
Chapter 3
EXPRESSION OF EMOTIONS
 Anxiety/Nervousness: High arousal levels would be expected with this emotion
(see chapter 2), and this is supported by evidence showing increases in mean F0, in F0
range, and high frequency energy. Rate of articulation is reported to be higher. An
increase in mean F0 has also been found for milder forms of the emotion such as worry or
anxiety.
 Resignation/Sadness: A decrease in mean F0, F0 range, and mean energy is
usually found, as are downward directed F0 contours. There is evidence that high
frequency energy and rate of articulation decrease.
 Contentment/Happiness: Findings converge on increases in mean F0, F0 range,
F0 variability and mean energy. There is some evidence for an increase in high frequency
energy and rate of articulation.
In summary, acoustic parameters, such as F0, intensity and duration, which have
seemingly clear relationships with the most prominent perceptual characteristics of
speech (i.e. pitch, loudness and speech rate) have received the most attention. These
parameters also tend to be the easiest to analyse.
In comparison, the modulation of spectral components of the speech signal, as
revealed by the formant structure, and overall structure of the average spectrum, has been
less studied. Such aspects of the speech signal, more related to voice quality attributes, do
not have so obvious perceptual correlates and they are more difficult and time consuming
to analyse. However, some evidences indicate that adding voice quality features could
provide resolving power when distinguishing between emotions with otherwise similar
prosodic profiles.
As a conclusion, both prosodic and voice quality features must be considered within
the emotional content of the speech. Next sections, 3.2 and 3.3, introduce how these two
acoustic profiles particularly help the discourse in the aim of expressing and
understanding emotions.
3.2 Prosody.
Emotion is an integral component of human speech, and prosody is the principle
conveyer of the speaker’s state and hence is significant in recovering information that is
32
Chapter 3
EXPRESSION OF EMOTIONS
fundamental to communication. Some studies have tried to demonstrate how slightly
differentiated meanings are introduced in speech communication by prosodic variations
and how differences in shape communicate the grade of uncertainty or certainty with
respect to the speaker’s knowledge state, specific emotional states, the intensity of
emotion, and the effects of other co-occurring emotions (s. [Yan01, Pae00]).
3.2.1 Classification of prosody.
Prosody is a general term for those aspects of speech that span groups of syllables or
words [Par86]. These properties are not typical attributes of the individual speech sounds,
but are characteristic of longer stretches of speech. Prosody conveys information between
a speaker and a listener on several layers. In [Kie96] two main categories of prosodic
feature levels are distinguished: acoustic and linguistic prosodic features.
The acoustic prosodic features are signal-based attributes that usually span over
speech units that are larger than phonemes (syllables, words, turns, etc). Within this group
two types can be further distinguished:
 Basic prosodic features are extracted from the pure signal without any explicit
segmentation into prosodic units. Here, fundamental frequency (F0) and energy framebased extraction, as detailed in section 7.1, are included. These features are not normally
used directly for prosodic classification; instead, they are the basis to calculate more
complex prosodic features.
 Structured prosodic features can be seen as variations of basic prosodic attributes
over time. Consequently, they are computed over a larger speech unit. Structured prosodic
features can derive from the prosodic basic features or can be based on segmental
information provided e.g. from the output of a work recognizer. The structures or
compound prosodic features used during this Thesis are detailed in section 7.2.
The second main category [Kie96] includes linguistic prosodic features, which can
be extracted from other knowledge sources, as lexicon, syntax or semantics and usually
have an intensifying or an inhibitory effect on the acoustic prosodic features. These
linguistic properties are not taken into consideration in the framework of the present
Thesis. In the following, the term “prosodic features” only refers to the acoustical
prosodic features.
33
Chapter 3
EXPRESSION OF EMOTIONS
The principal prosodic variables of interest, as indicated in section 3.1.3, are energy,
pitch and durational measurements such as speaking rate. Every prosodic feature has an
acoustic correlated. However, this correspondence is not unique, since prosodic attributes
refer to perceived phenomena.
3.2.2 Energy as a prosodic cue for emotion detection.
Energy is the acoustic correlated of loudness; their relation is not linear, because it
strongly depends on the sensitivity of the human auditory system to different frequencies.
Hence, the influence of energy on emotional perception is a combination of different
surrounding factors and valuable conclusions can also be extracted from global statistics
directly derived from the energy contour.
350
450
400
300
350
Angry
300
Bored
250
Happy
200
Neutral
150
200
150
250
Sad
100
100
(a) Energy mean for male
speaker
(b) Energy mean for female
speaker
Figure 3.2. Energy mean of the sentences uttered by (a) a male speaker and (b) a female speaker.
In terms of global statistics, energy is proved to be higher in emotions whose
activation is also high. On the contrary, low energy levels of energy are found in
emotional states with a low activation value (see table 3.1). Figure 3.2 represents the
mean of the energy means over a whole sentence for all the utterances, separated by
emotions, of a male speaker and (b) a female speaker. Equally, figure 3.3 represents the
energy maximum under the same conditions.
34
Chapter 3
EXPRESSION OF EMOTIONS
2000
2500
2000
Angry
Bored
1500
1500
1000
Happy
1000
Neutral
500
Sad
500
0
(a) Energy maximum for
male speaker
(b) Energy maximum for
female speaker
Figure 3.3. Pitch mean of the sentences uttered by (a) a male speaker and (b) a female speaker.
Both statistics are in agreement with the relational theory proposed in [Mur93]
summarised in table 3.1. High energy levels are found in angry and happy utterances,
while sadness and boredom yield lower intensity values. Energy is one of the most
intuitive indicators in the relation voice-emotion, since, as we said before, it is
unequivocally related to the acoustical loudness. Even if we are not experts in this matter,
we could easier imagine someone angry shouting than gently whispering.
3.2.3 Pitch as a prosodic cue for emotional detection.
The acoustic correlate of pitch is the fundamental frequency or F0. Voice pitch is
certainly a key parameter in the detection of emotion. Pitch variation over a sentence, also
called intonation, is used in most languages to give shape to a sentence and indicate its
structure, although the way in which this is done varies widely between languages. In
some languages pitch is also used to help indicate the meaning of words. For instance, a
Chinese word does not use a stress system like that in English. Besides that, Chinese
word lacks flexion, so that its formation needs the help of prosody. Words formatted in
prosodic features are the basic units of rhythmic construction in Chinese.
Pitch ranges between 80 and 160 Hz for male talkers and between 160 and 400 Hz
for females. Pitch can be characterised by means of its level or its contour. In English,
pitch is quantized into four levels (low, mid, high and extra high) and three terminal
contours are distinguished: fading (decreasing pitch and amplitude), rising (increasing
pitch, amplitude nearly constant until the end), and sustained (pitch and amplitude
35
Chapter 3
EXPRESSION OF EMOTIONS
approximately constant). Fundamental frequency is considered to be one of the most
important attributes in emotion expression and detection (s. [Mon02, Abe01]).
250
400
350
200
Angry
300
Bored
150
Happy
100
250
Neutral
200
Sad
150
50
100
(a) f0 mean for male speaker
(b) f0 mean for female speaker
Figure 3.4. Pitch mean of the sentences uttered by (a) a male speaker and (b) a female speaker.
Statistics extracted from the basic prosodic features contain relevant information
about the uttered emotion. From the pitch contour of one utterance we extract the mean,
maximum, minimum, variance and standard deviation among other features listed in
chapter 7. Figure 3.4 represents the mean of the pitch mean over a whole sentence for all
the utterances, separated by emotions, of (a) a male speaker and (b) a female speaker.
According to the theory [Mur93], we obtain clearly differentiate statistics for the pitch
mean depending on the uttered emotion (see figure 3.4). In both cases, regardless of the
scale difference due to the sex condition, happiness and anger present a higher pitch
average, while boredom and sadness means are slightly slower with reference in the
neutral emotion (absence of emotion).
Pitch range, calculated over the same two speakers, also depends strongly on the
emotion, as it can be seen in figure 3.5, according to the conclusions achieved by Murray
and Arnott [Mur93] and resumed in table 3.1. Different statistics for other prosodic
features can be found in section 7.2.
Apart from global statistics extracted from fundamental frequency, pitch shape
characteristics such as pitch slope and concavity and convexity are very important
features in differentiating intonational meaning. For instance, the concavity or convexity
of slope is critically related to the perceived grade of severity or softness of the utterance
and these shape characteristics reflect the underlying expressive states that often arise
36
Chapter 3
EXPRESSION OF EMOTIONS
from the discourse process itself. Therefore, not only global statistics of the pitch, but also
attributes related to the pitch contour, should be taken into consideration. The pitch
contour can be estimated following different models:

Mathematical approximation, i.e. linear regression coefficients.

Symbolic representation, i.e. stylisation into concatenation of straight lines
according to rules (local maximum and minimum).

Intonation models, i.e. two-component model superposition of fast pitch
movements on a slowly declining line (baseline)
Despite the fact that some sources claim that none of the models preserve the
information necessary for emotional speech, various studies stated the importance of
intonation as a medium for expressing emotions (s.[Wil72, Cos83, Cah90]). Accordingly,
Sylvie J.L. Mozziconacci completed a study concerning the expression of emotions when
the pitch variation is represented in the theoretical framework of an intonation model
[Moz00]. There, the power of existing intonation models for modelling expressiveness
was contemplated. She considered that studying speech variability (pitch level and range)
through such crude measures as pitch mean and standard deviation, can obscure a
substantial part of the variation present in the speech signal, and do not provide any
information concerning relevant deviations. Finally, a cluster analysis of different
emotional pitch contours was performed and it was concluded that predominantly the
final configuration of the contour plays a role in conveying emotion.
On the other hand, most of the world-known languages employ a structural parameter
called stress. It’s characteristic of these languages that certain parts are felt to be more
prominent than others, whether in isolated polysyllabic words or in larger stretches of
continuous speech [Slu95]. Such prominent parts stand out from their environment due to
(among other things) increased vocal effort (intensity), more accurate articulation, longer
duration and pitch changes; stress tends to raise pitch and stress-produced pitch
irregularities are superimposed on the pitch contour. In spite of being introduced in this
section, stress could be also considered as an acoustic correlated of pitch, energy and
duration, due to the influence of each one of these attributes on the stress perception. A
related study by Paeschke, Kienast and Sendlmeier, showed that the intensity of stressed
syllables differentiated between the groups excited and non-excited emotions.
37
Chapter 3
EXPRESSION OF EMOTIONS
200
120
100
Angry
80
150
Bored
60
Happy
40
Neutral
20
Sad
100
50
0
0
(a) f0 range for male speaker
(b) f0 range for female speaker
Figure 3.5. Pitch range of the sentences uttered by (a) a male speaker and (b) a female speaker.
3.2.4 Durational cues for emotional detection.
Prosody involves also duration-related measurements. One of the most important
durational measurements in the aim to discriminate among speaker’s emotional states is
the speaking rate. An acoustic correlated of the speaking rate can be defined as the
inverse of the average of the voiced region length within a certain time interval. Also
pauses contribute to the prosody and can be divided into two classes: unfilled and filled
pauses. An unfilled pause is simply silence or it may contain breathing or background
noises, whereas a filled pause is a relatively long speech segment of rather uniform
spectral characteristics. During a filled pause a hesitation form is usually uttered
indicating that one is about to start or to continue speaking: e.g. English [[schwa][long]],
[[schwa ]mm], etc.
In conclusion, the significance of prosodic meaning to communicate judgements,
attitudes and the cognitive state of the speaker, this makes it essential for speech
understanding projects such as emotion and intention tracking and to the development of
natural-sounding spoken language systems.
3.3 Voice Quality
Most of the studies of the past decades related to the expression of emotions through
the voice dealt with the investigation of prosodic parameters, mainly fundamental
frequency, intensity and duration characteristics. In addition, parameters describing
38
Chapter 3
EXPRESSION OF EMOTIONS
laryngeal processes on voice quality have been recently taken into account (s.[Kla00,
Ban96]).
3.3.1 Voice quality perception.
Voice quality is defined by Trask [Tra96] as the characteristic auditory colouring of
an individual’s voice, derived from a variety of laryngeal and supralaryngeal features and
running continuously through the individual’s speech. A wide range of phonetic variables
contributes to the subjective impression of voice quality.
The human voice possesses a wide range of variation, and the natural and distinctive
tone of speech sounds produced by a particular person yields a particular voice, which is
rarely constant in the course of a conversation. The reasons for these changes are diverse;
voice quality is influenced by the social context and situation and phonetic habits are also
a part of our personal style. It’s possible to change our voice by modifying the phonetic
settings, but several components of voice quality are innate and outside the speaker’s
control. Voice quality is usually changed to strengthen the impression of emotions. As it
was said in section 3.1, prosody suffers some changes relying on the speaker’s emotional
state, but voice quality is an additional valuable phonetic cue for the listener.
Definitions for voice quality or phonation types used in normal speech manly come
from studies of pathological speech (laryngeal settings) and it is hard to describe voice
quality, especially variations in a normal voice. The term voice quality subsumes
attributes, which concern the overall phone independent spectral structure, for example,
shimmer, formant frequencies variation or some relative energy spectral measurements.
Since the notion of voice quality is unquestionably connected to the speech
production theory, a short overview is given in following section 3.3.2. There, basic
concepts are clarified in order to reach a better understanding of how emotional state can
influence the speaker’s voice quality.
3.3.2 Speech production theory overview.
Acoustic speech signal in humans is commonly considered to result from a
combination of a source of sound energy (e.g. the larynx) modulated by a transfer (filter)
function determined by the shape of the supralaryngeal vocal tract. This combination
39
Chapter 3
EXPRESSION OF EMOTIONS
results in a shaped spectrum with broadband energy peaks. This model is often referred to
as the "source-filter theory of speech production" and stems from the experiments of
Johannes Müller (1848).
The source-filter theory of speech production hypothesises that an acoustic speech
signal can be seen as a source signal (the glottal source, or noise generated at a
constriction in the vocal tract), filtered with the resonances in the cavities of the vocal
tract downstream from the glottis or the constriction. Figure 3.6 shows a schema of this
process for the case of voiced speech (periodic source signal).
Figure 3. 6. Source filter model of speech production
According to the generalised speech production theory [Mar97], the source for
voiced speech is generated by the vibration of the vocal folds. An air pressure difference
is created across the closed vocal folds by contraction of the chest and/or diaphragm
muscles. When the pressure difference becomes sufficiently large, the vocal folds are
forced apart and air begins to flow through the glottis; this is the opening phase of the
glottal cycle, as depicted in figure 3.7. When the pressure difference between the subglottal and supra-glottal passages is sufficiently reduced, airflow begins to reduce and the
glottis begins to close. This is the closing phase of the glottal cycle. Closing occurs more
rapidly than opening, apparently because the tension on the vocal folds is reinforced by
Bernoulli effects1. The glottis quickly closes, resulting in the closed phase of the glottal
cycle. These phenomena are cyclically repeated, with period T0, yielding a periodic
1
For a non-viscous, incompressible fluid in steady flow, the sum of pressure, potential and kinetic energies
per unit volume is constant at any point:
p + ρv2/2 + ρgy = constant
where p is the pressure, ρ is the density, v the velocity and y the height in a gravitational field of strength g,
all measured at the same point. This quantity is then constant throughout the fluid. In few words: where the
pressure is high, the velocity is low and where the pressure is low the velocity is high.
40
Chapter 3
EXPRESSION OF EMOTIONS
source signal. This periodic signal is the origin of voiced sounds. On the other hand, the
source of unvoiced sounds has no periodicity and is considered to be white noise.
T0 - duration of the pitch period.
t1 - begin of the airflow,
t2- instant of the maximum
glottal flow of the amplitude AV
through the glottis,
t3- moment of the glottal closure
and maximum change of glottal
flow.
t4 - instant of complete glottal
closure.
Figure 3.7. Description of the glottal waveform and its derivative.
An important measure parameter of the glottal flow is the Open Quotient (OQ), which
is the ratio of the time in which the vocal folds are open and the whole pitch period
duration ((t4-t1)/T0). Another parameter, used to characterise the glottal source, is the
Speed Quotient (also called skewness or rk), which is defined as the ratio of rise and fall
time of the glottal flow ((t2-t1)/(t4-t2)). These two parameters can change relying on the
voice quality nature, as remarked in next section 3.3.3.
Voiced sounds consist of fundamental frequency (F0) and its harmonic components
produced by vocal cords (vocal folds). But not only the source participates in the speech
generation process, also the vocal tract, which acts as a filter (see figure 3.8), modifies
this excitation signal causing formant (pole) and sometimes anti-formant (zero)
frequencies [Wit82]. Human perceptual system translates patterns of formant frequencies
into specific vowels. Each formant frequency has also amplitude and bandwidth and it
may be sometimes difficult to define some of these parameters correctly. The
fundamental frequency and formant frequencies are probably the most important concepts
in speech processing in general.
41
Chapter 3
EXPRESSION OF EMOTIONS
With purely unvoiced sounds, there is no fundamental frequency in the excitation
signal and therefore no harmonic structure either, and the excitation can be considered as
white noise. The airflow is forced through a vocal tract constriction, which can occur in
several places between glottis and mouth. Some sounds are produced with complete
stoppage of airflow followed by a sudden release, producing an impulsive turbulent
excitation often followed by a more protracted turbulent excitation. Unvoiced sounds are
also usually more silent and less steady than voiced ones.
Figure 3.8. Spectral manifestation of the speech production.
3.3.3 Influence of the source on voice quality.
Phonation is the contribution that the larynx makes to speech. The phonation type
strongly influences the properties of a generated sound. People may use different types of
phonation in different situations, either consciously or unconsciously (e.g., whisper when
telling a secret, breathy voice when excited, creaky voice during a hangover). A person's
tendency to use various (combinations of) types of phonation is one of the most important
things that make their voice "theirs". People who need to talk in two or more different
identifiable voices (e.g., puppeteers) will usually use different phonation types for the
different voices.
The differences in phonation are caused by a change in the excitation pulse. Thus, the
glottal waveform is different for every individual phonation type. Its specific
42
Chapter 3
EXPRESSION OF EMOTIONS
characteristics depending on the respective phonation type are described in comparison to
modal (normal) phonation, taking over different approaches:
 Breathy voice: This phonation is a combination of breath and voice, which occurs
if the vocal cords do not close completely along their entire length while they are
vibrating, the air which flows through the remaining aperture adds whisper to the vocal
cord vibrations. It is characterised by a longer fall phase and a more gradual flow cut; the
glottal pulse is more symmetrical; high Open Quotient; high peak flow glottal and lower
pitch.
 Whispery voice: Air passes with turbulence through a small opening at the back of
the glottis. It presents high OQ, but lower than for breathy voice; pulses more skewed
than for breathy voice, but more symmetrical than for normal voice; a high peak glottal
flow, but lower than for breathy voice, which implies that H1 is lower in the source
spectrum.
 Creaky voice: In this kind of voicing, the vocal cords are stiffened, so that they are
very rigid as they vibrate. It presents extremely low fundamental frequency; irregular
pulses; low OQ and low glottal flow rate, resulting in weaker low frequency components,
particularly H1; impulses are relatively symmetrical, with a relatively short rise time
which dampens the low frequency components.
 Falsetto: The vocal folds are stretched tightly so that they become very thin. The
resulting vibrations can have over twice the frequency that a speaker can produce using
modal voicing. It causes very high pitch; rather low glottal peak flow; often with glottis
slightly open, thus, the effect of turbulent flow is observed in the spectrum; pulses are
quite symmetrical.
 Tense voice: sharp cutting off of the glottal flow, boosting the higher spectrum
components, very high skewness of the glottal pulse following a longer rise time; small
Open Quotient; low frequency components of spectrum attenuated in comparison to the
higher components.
 Lax voice: comparable to breathy voice; long rise time of the glottal pulse.
43
Chapter 3
EXPRESSION OF EMOTIONS
3.3.4 Influence of the filter on voice quality.
The vocal tract, which acts as the filter in the speech production mechanism, is a
continuous tube extending from the glottis to the lips. The length of the vocal tract, from
vocal folds to lips, is about 17 cm in adults. The shape of the vocal tract will change as
we move our tongue, jaw, palate, and lips (see figure 3.9). Depending on the position of
these structures and the overall shape of the vocal tract, certain sound frequencies will be
amplified by the tract and others will be dampened (see figure 3.8). Those frequencies
that are amplified are called the formants. In speech, the resonating cavities in the vocal
tract "shape" the glottal tone in ways consistent with the laws of acoustic resonance.
Subtle changes in tongue, jaw, palate or lips position can produce a significant change in
formant frequencies, and therefore a totally different vowel sound. The process of
modifying the shape of the vocal tract to produce different sounds is called articulation.
Figure 3.9. Different resonant cavities affect the speech source.
Even when the same vowel is uttered, slightly changes in the shape of the filter
(tongue, jaw, palate or lips) cause variations in the formant structure. These changes
might not be significant enough to change the vowel perception, but can provide some
cues when trying to find particular non-semantic aspects of the speech. Some
consequences of the filter shape variation on the formant structure are:
 All formant frequencies decrease uniformly as the length of the vocal tract
increases. This rule is good common sense; larger objects resonate sound at lower
44
Chapter 3
EXPRESSION OF EMOTIONS
frequencies. For this reason, long organ pipes, long strings, large drums, and big
loudspeakers are used to produce low notes. However, humans cannot greatly change
their vocal tract length. Approximately a 10% increase or decrease is possible by
lowering or raising the larynx and by protruding or retracting the lips. This produces a
comparable percentage shift in the formant frequencies. The result is a darker (or
brighter) colouring of the vowels.
 All formant frequencies decrease uniformly with lip rounding and increase with
lip spreading. Lip rounding is similar to partially covering the mouth. In both cases, the
effective tube length increases (acoustically). This lowers all the resonant frequencies. It
is likely that the term "covered sound" originated as a result of musicians covering the
mouth ends of their instruments, particularly brass players. In combination with larynx
height adjustments, lip rounding or spreading can be very effective in darkening or
brightening the vowels.
 A mouth constriction lowers the first formant and raises the second formant. This
creates a more diffuse vowel spectrum. The acoustic energy is spread out over both low
and high frequencies, as in vowels [i] and [e].
 A pharyngeal constriction raises the first formant and lowers the second formant.
This makes the vowel spectrum more compact, as in the case of or [o].
3.3.5 Influence of emotions on voice quality.
Most studies, focusing on the expression of emotions, have concentrated on the
prosody and, thought these researches yield successful emotional expression in synthetic
speech by manipulating pitch and duration, a lack of expressiveness is founded due to the
absence of voice quality considerations. Apart from the influence of emotional state on
prosodic parameters (F0, speech rate), influences on the glottal pulse shape (e.g. open-toclosed ratio of the vocal chords), harmonics-to-noise ratio, spectral energy distribution
and correlations with some kind of voicing irregularities have been proven [Kla97]. In
particular, some hypotheses [Alt99] defend that some parameters as the Open Quotient
and the Glottal Opening (see section 3.3.2), could reflect special properties of the vocal
tract during the production of emotions, since spectral content is closely related to voice
quality parameters correlated directly to speaker emotional state.
45
Chapter 3
EXPRESSION OF EMOTIONS
With the aim of detecting properties of physiological processes during the production
of different emotional states, voice source quality parameters are analyzed in [Alt99] for a
subset of the acoustic signals including three different emotions: happiness, neutral state
and anger. The data indicate that the opening quotient and the glottal opening tendentially
reflect differences between anger and neutral statements. First observations confirm that
happiness induces more breathiness in comparison with negative and/or neutral emotional
state. Therefore, they suppose that happiness is produced with a higher subglottal pressure
than angry or neutral statements, what should be noted by means of the open quotient.
On the other hand, the amplitude of the first formant depends on the grade of glottal
opening during an open cycle. The intensity of this formant therefore depends also on the
Open Quotient. For the calculation of both parameters, they use spectral measures. This
method is also employed in this Thesis and is further detailed in section 7.3.4. Breathings
and roughness are also used as parameters involving emotion and they are estimated by
the harmonic-to-noise ratio (HNR). They show significantly higher HNR in the sentences
expressed with anger than the neutral expressions. However, [Pin90] found that
estimation of the HNR is also influenced by other attributes like frequency variations
(jitter) or amplitude variations (shimmer).
In a different experimental study performed by M. Kienast and W.F. Sendlmeier
[Kie00] it was shown how various characteristics concerning vowel quality, segmental
reduction and energy distribution in voiceless fricatives were specific for each one of five
different emotions: fear, sadness, boredom, anger and happiness. Particularly, it was
revealed that vowel formant analysis yields useful tools for emotion discrimination.
Sentences expressing fear, sadness or boredom are characterised for a formant shift
towards a centralized position in all different vowels. The results obtained by analysing
the formant values in angry sentences present differences from the findings for sadness,
fear and boredom. In angry sentences a displacement of the vowels in a more extreme
part of the vowel chart was observed in the speech samples of both male and female
speakers. Overall, the vowel chart became larger and the different vowels became more
distinct (see figure 3.4). In sentences with happy emotion, the analysis of the formant
frequencies revealed an increase of both the first and second formant. Following the
influence rules of section 3.3.3, a general shortening of the vocal tract could explain this
46
Chapter 3
EXPRESSION OF EMOTIONS
result. The shortening could be caused by lip spreading occurring during smiling as noted
by Tartter [Tar80]. A supplementary factor for this effect can also be seen in a rising of
the larynx, which is often accompanied by a high fundamental frequency. A setting of
raised larynx in happy speech is probable, because happy speech is reported to have
generally a great increase in pitch mean. The result that the first formant was more
affected than the second than the second can be interpreted by a greater opening degree
during mandible movements in happy speech.
(a)
(c)
(b)
(d)
Figure 3.10. Average formants value for (a) sadness, (b) boredom, (c) happiness and (d) anger (grey
symbols) compared with neutral (black symbols). From experiments in [Kie00]
47
Chapter 3
EXPRESSION OF EMOTIONS
With the aim to contrast empirically this emotion-formant relation, the following
experiment is performed with one male speaker from our own database (see Chapter 4):
Mean of the first and second formant frequencies are calculated independently for each
one of the five emotions considered in the framework of this Thesis. The calculations are
made over a selected voiced region considered the most likely to an /a/ (see section 7.3),
since it’s proved that spectral estimations can be more accurately calculated over open
vowels. Results are presented in figure 3.11.
510
1320
500
1300
490
Angry
1280
480
Bored
1260
470
Happy
1240
460
Neutral
1220
Sad
450
1200
1180
440
1st formant frequency
2nd formant frequency
Figure 3.11. Variation of first and second formant frequencies through different emotions.
Same conclusions can be extracted from our data, as it was expected. Happy emotion
tends to raise both formant frequencies, as it happened in the study by Kienast and
Sendlmeier [Kie00]. Some other studies explain this notorious change in formant
structure arguing that happy expression usually comes together with a slightly smiling,
which causes lips movements and force the rounded vowels to sound unrounded.
Comparison of different quality features from utterances with dissimilar emotional
content is presented in section 7.2.
Another study related to expression of emotions, [Zet98], also concludes that voice
quality, as well as pitch and speech rhythm, plays an important role in giving the
impression of different emotions. There, one male actor corpus is analysed. In angry
utterances he uses rather high pitch and a high intensity. The impression is strengthened
by the slow speech rate and the very distinct articulation. The speaker ‘s voice quality is
breathy, as it usually happens in the expression of anxiety (see figure 3.13). This
observation is based in the phonation properties introduced in section 3.3.3, where
general characteristics of each voice quality type were given. Figures 3.12, 3.13 and 3.14
48
Chapter 3
EXPRESSION OF EMOTIONS
show the temporal signal of the same sentence uttered in three different emotional states:
neutral (no emotion), anxiety and sadness and differences based on voice quality become
obvious.
Figure 3.12. Neutral voice.
Figure 3.13. Anxiety – breathy voice.
Sadness has a low pitch and intensity and there is only a small pitch range. The
speech rate is rather high with a lax articulation. He uses a creaky voice quality (s. figure
3.13) and that strengthens the impression of the emotion. In the expression of joy, the
recordings show variations in intonation and a rather high pitch compared with the neutral
case.
Figure 3.14. Sadness – creaky voice.
As a conclusion of this chapter, both prosodic and quality voice features are found to
influence the recognition of the emotional state of the speaker. As Murray asserted in
[Mur93], features related to the pitch envelope are the most important parameters in
differentiating between the basic emotions2, and it is the voice quality, which is important
2
Murray and Arnott [Mur96] describe the vocal effects on five basic emotions: fear, anger, sadness,
happiness, and disgust.
49
Chapter 3
EXPRESSION OF EMOTIONS
in differentiating between the secondary emotions. He also noted that the acoustic
correlates of basic emotions are cross-cultural, but those of the secondary emotions are
culturally specific.
This last idea anticipates some of the problems that are found when detecting
emotions in a speaker independent way: Difficulties that are found when creating an
automatic emotion recognition system for one specific person, will be gravely increased if
the intention is to make this system ignore the specific properties of each speaker.
3.4 Non-speech related ways of expression and detection of
emotions.
As figure 3.1 shows, emotions are also expressed through other human
communication channels. Therefore, though information extracted from the speech signal
seems to be capable of distinguishing certain emotions, there is also an increasing interest
in the problem of recognizing emotion from facial expressions. These studies are based
on Darwin’s pioneering work [Dar65] and are possible thanks to the extensive studies on
face perception during the last two decades (s.[Ekm73, Dav75, Sch84]).
In contrast to speech, emotion recognition through face analysis counts with a good
deal of information, mainly resultant from neurobiological research. Techniques like noninvasive brain imaging have been applied in order to localize face regions in the human
brain. Nevertheless, the automatic recognition of emotions using facial expression is not
an easy task:
“Analysis of the emotional expression of a human face requires a number of preprocessing steps which attempt to detect or track the face; locate characteristic facial
regions such as eyes, mouth, and nose on it; extract and follow the movement of facial
features, such as characteristic points in these regions; or model facial gestures using
anatomic information about the face.“ [Cow01]
Most of the information about facial expression can be extracted from the position of
the eyebrows and positions of the corners of the mouth. With this aim, two main
50
Chapter 3
EXPRESSION OF EMOTIONS
categories of approaches derive from computational studies of facial expression
recognition:
 Target oriented: Only one single image of a face at the apex of the expression is
analysed.
 Gesture oriented: Uses facial temporal information from a sequence if images in
an emotional episode.
Independently of the approach, the first step is to perform face detection or tracking.
As well as in speech analysis there are many pitch estimation techniques, face tracking
has been attempted using different tracking systems. Simplest methods try to differentiate
regions through the colour, based on the idea that the distribution of the chrominance
components of a human face is located in a very small region of the colour space [Gra96].
Other systems use templates information and results obtained improve the recognition
achieved in colour-based methods. But they are generally computationally expensive. In
all the cases, estimation of the pose of the face is undoubtedly necessary, since changes in
expression can be due to changes in the viewing angle.
Despite all the physiological studies related to face expression provide a consistent
base for the development of automatic recognition systems, target-based approaches do
not transfer easily to machine vision in real applications. On the other hand, facial-gesture
tracking approaches have produced promising results, but their psychological basis needs
further exploration.
Some approaches have also tried to combine different sources to facilitate automatic
recognition of emotions. They defend that machine recognition should improve when
combining the acoustical source with an optical source that contains information from the
facial region such as gestures, expressions, head-position, eyebrows, eyes, ears, mouth,
teeth, tongue, cheeks, jaw, neck, and hair. Facial expression in combination with
emotional speech cues is undertaken in lately approaches. These attempts are also
supported by the idea that human speech production and facial expression are inherently
linked by a synchrony phenomenon, where changes often occur simultaneously with
speech and facial movements. An eye blink movement may occur at the beginning or end
of a word, while oral-cavity movements may cease at the end of a sentence.
51
Chapter 3
EXPRESSION OF EMOTIONS
According to an assessment by the MIT Media Lab, “research in computer
recognition and expression of emotion is in its infancy”. Two of the current research
efforts at the MIT Media Lab focus on recognition of facial expression and voice affect
synthesis. These are not, of course, the only ways to recognize affective states; posture
and physiological signs like gestures and increased breathing rate, for example, also
provide valuable cues.
Finally, somatic changes are also being used for emotion recognition. IBM (Almaden
computer science research) is working on a method and system for correlating
physiological attribute including pulse, temperature, general somatic activity (GSA), and
galvanic skin response (GSR) to the emotion of someone touching a computer input
device, such as mouse. That means, by simply touching the mouse, the computer will be
able to determine a person’s emotional state based on the somatic channel of emotional
communication (see figure 3.1).
52
Download