Processing audio-visual stimuli: imaging studies

advertisement
The processing of
emotional audiovisual stimuli
By: Mariska Koekkoek
Studentnumber: 3670864
Utrecht University
Supervisors: MSc. C. van den Boomen, Prof. dr. C. Kemner
Second reviewer: Dr. M.J.H.M. Duchateau
Date: 17-08-2012
Index
Abstract ................................................................................................................................................... 3
Introduction ............................................................................................................................................. 4
The effect of bimodal stimuli on the recognition of emotions: behavioural studies ............................... 5
Processing audio-visual stimuli: EEG studies ........................................................................................... 7
Processing audio-visual stimuli: imaging studies .................................................................................... 9
Discussion .............................................................................................................................................. 14
References ............................................................................................................................................. 18
2
Abstract
In a natural environment humans constantly receive a multitude of different signals and
more often than not these signals engage multiple senses at the same time. These signals
make it possible to understand and interact with our environment. In order to express
ourselves and communicate with others we use facial expressions and different tone of
voice. Correctly and swiftly identifying these signals is an important aspect of interaction.
Due to the combination of both auditory and visual cues used during interaction it is not only
important to investigate the processing of these signals separately, but the interaction of the
processing of these signals as well. A number of studies have investigated the difference
between the processing of uni- and multimodal emotional and non-emotional stimuli. A
number of models have been suggested to investigate interaction between multimodal
stimuli. Here the results found in regards to the interaction of emotional and non-emotional
audio-visual stimuli on several different levels such as behavior and brain activity will be
discussed, as well as evidence supporting different models pertaining this interaction. In
conclusion it is found that both emotional and non-emotional audio-visual stimuli show
interaction during sensory processing. Moreover the processing of emotional bimodal seems
to depend in part on the emotion.
3
Introduction
In our everyday life we constantly receive various kinds of signals from our environment.
These signals make it possible to interact with our surrounding and with other beings living
in this environment. However signals are often perceived by multiple senses at the same
time, e.g. sight and hearing, and all of these signals are then used to comprise a picture of
our surrounding.
The processing of these signals has been studied extensively, although a lot of these
studies focused on a single module instead of multiple modules. In a natural environment
however, more than one sense often receives information at the same time. Investigation of
the combination of multiple modules is therefore important. This led to the discovery of
interesting phenomena. For example hearing and seeing someone speak at the same time
may lead to perception of a sound that is neither formed by the auditory or visual aspect
alone (McGurk and MacDonald, 1976). Additional evidence that the integration of sound and
image can have an illusory effect came from Shams et al (2001) who showed that when one
flash of light is accompanied by two bursts of noise, the single flash is perceived as two
separate flashes. Furthermore visual cues may also draw attention towards a possible
location of an auditory clue, but only under the condition that the auditory clue is difficult to
locate (Spence and Driver, 2000; Vroomen et al., 2001a). Aside from illusory effects both
seeing and hearing speech can also help in the understanding of what is spoken (Grant,
2001; Grant and Seitz, 2000; Swartz et al. 2004). All these examples illustrate that there
seems to be an interaction in the processing of auditory and visual stimuli.
However these studies primarily used very general and biological irrelevant stimuli.
As humans we interact daily with other people. An important aspects of these interactions is
the emotion expressed by others through for example face or voice. The ability to accurately
and swiftly identifying these emotions in others is therefore very important. Early research
to the recognition of emotional facial expression has found that basic expressions are
recognized as portraying a particular emotion across cultures. Ekman and Friesen (1969)
conducted an experiment to investigate whether emotional faces are recognized across
cultures regardless of prior experience or lack of interaction between these cultures. Results
showed that facial expressions are recognized as portraying a particular emotion and could
therefore be considered universal, however they acknowledged the possibility that subjects
were able to recognize these expression because they were all extensively exposed to the
same mass media photos of the facial expressions. This may have led them to recognize the
expressions. In a follow-up study to investigate this possibility Ekman and Friesen (1971)
showed that cultures who have been minimally exposed to mass media and literate cultures,
are able to recognize basic emotions such as fear, anger, sadness, disgust, happiness and
surprise from other cultures as well. These results indicate that it is very likely that the
recognition of emotions, an important aspects of human interaction, is universal.
Apart from facial expressions emotions can be expressed by voices as well. It has
been found that sometimes emotions may be easier to recognize using cues from either
4
facial expressions or tone of voice (de Silva et al., 1997). For example it has been found that
happy emotions are quite easy to recognize from facial expressions (de Gelder et al., 1998),
but quite difficult to recognize from the tone of voice (Vroomen et al., 1993). When
investigating if vocally expressed emotions are recognized across cultures like facial
expression are similar results have been found. Participants from different cultures were
able to discriminate emotions from sentences spoken in foreign languages well above
chance level (Pell et al., 2009; Scherer et al., 2001; Thompson and Balkwin, 2006). However
some emotions such as anger and sadness were easier to recognize in foreign languages
than other emotions such as fear and joy (Pell et al., 2009; Thompson and Balkwill, 2006)
indicating again that some emotions are easier to recognize than others. These results show
that not only are emotional expression universal, the recognitions of emotions from the
voice seems to be easily recognized across cultures as well. In short emotions are easily
recognized across cultures regardless of whether it is expressed through faces or voices. The
integration of audio-visual stimuli however seems to lead to different effects. It is therefore
interesting to know the effect of emotional audio-visual integration.
In this paper the effect of emotional audio-visual interaction on the on behavior and
brain activity will be investigated. To this aim emotional audio-visual stimuli will be
compared to unimodal emotional stimuli as well as to non-emotional audio-visual stimuli.
Behavioral, EEG and imaging studies will be reviewed in order to get a full understanding of
the effect of bimodal stimuli on behavior and brain activity compared to unimodal stimuli. In
order to determine whether integration of the processing of audio-visual stimuli occurs,
which seems to be suggested by the phenomena that occur when both auditory and visual
stimuli are presented, two different models will be reviewed which will be explained
momentarily. Furthermore the results of the different studies will be analyzed to determine
the nature of this interaction.
The effect of bimodal stimuli on the recognition of emotions: behavioural studies
It is a well known phenomena that the simultaneous presentation of auditory and
visual stimuli leads to a lower reaction time compared to visual and auditory stimuli
separately. Additionally it also leads to a higher accuracy in the identification of said stimuli
(Giard and Peronnet, 1999; Molholm et al., 2002, 2004; Teder-Sälejärvi et al., 2002; TederSälejärvi et al., 2005). Reaction to bimodal (redundant) targets is often swifter than reaction
to unimodal targets. To explain this effect multiple models have been suggested. One of
these models is the race-model. This model suggests that the input from both sources is
processed simultaneously and the first stimulus to be processed determines the reaction
time (RT) and that no interaction occurs. The probability of getting a faster RT than a given
time is higher when using bimodal targets compared to unimodal targets (Gondan et al.,
2005). However there is a limit to the redundancy gain that can be explained by this effect as
is explained by Miller’s inequality model (see Miller, 1982). Violation of this inequality
model, for example a redundancy gain higher than the inequality model allows, leads to the
5
rejection of separate activation models such as the race-model. In general studies
investigating redundant target effect often examine whether the RT in response to bimodal
and unimodal stimuli exceeds the inequality model in order to determine if the race-model is
adequate to explain the difference in RT facilitated by bi- and unimodal stimuli. If it is found
inadequate the model is rejected for another model such as co-activation models which
suggest that information from both sources are integrated, which would explain the
excessive redundancy gain (Gondan et al., 2005). Adopting either the race-model or a coactivation model leads to a difference in assumption of the nature of the processing of
audio-visual stimuli. The race-model suggests that there is no neural interaction in the
processing of stimuli, while the co-activation model suggests there is. However simply the
failure to violate the race-model does not prove that an interaction does not occur.
A number of studies have used the inequality model to investigate the redundancy
gain of both emotional and non-emotional audio-visual stimuli (Collignon et al., 2008, 2010;
Corballis, 1998, 2002; de Gelder and Vroomen, 2000; Massaro and Egan, 1993; Marzi et al.,
1996; Molholm et al., 2002, 2006; Murray et al., 2001; Teder-Sälejärvi et al., 2005). And
although some studies found the race-model to be sufficient in explaining the redundancy
gain (Corballis, 1998, 2002; Murray et al., 2001) a large number of studies found a violation
of the inequality model when using both non-emotional and emotional stimuli (de Gelder
and Vroomen, 2000; Gielen et al., 1983; Gondan et al., 2005; Massaro and Egan, 1993 Miller,
1982; Schröger and Widmann, 1998). The redundancy gain of multimodal stimuli exceeds
the probability summation of the race-model, suggesting that there is an interaction in the
processing of both stimuli. This leads to the rejected of this model in favor of a co-activation
model. However it has been argued that the effect of combined modalities might be due to
the instructions given in some cases, which leads to subjects paying closer attention to both
modalities. In order to disprove this effect de Gelder and Vroomen (2000) used a test in
which subjects were instructed to pay attention to only one modality and ignore the input of
the other. Should the effect still be present under these conditions it would suggest that one
modality is affected by input from the other, regardless of attentional bias. The results
showed that this was indeed the case. The effect found when subjects were instructed to
pay attention to only one of the modalities was smaller, but still present. This indicates that
the influence of the input from modalities to which one pays no attention is involuntary. This
cross-modal interaction is even present when participants’ attention is further diverted to a
completely unrelated task (Vroomen et al., 2001b). It was therefore suggested that crossmodal interaction is present as well as automatic.
Besides the effect bimodal stimuli have on the processing of either modality, when
using emotional bimodal stimuli other effects may influence RT as well, such as congruency.
Congruent emotions are recognized swifter compared to incongruent emotional pairs (de
Gelder and Vroomen, 2000; Massaro and Egan, 1993), comparable to the effect found in
congruent and incongruent spatial distribution of visual and auditory stimuli in which
congruent spatially distributed multimodal stimuli lead to a lower reaction time compared to
6
incongruently distributed stimuli (Miller, 1991). The similar results found by de Massaro and
Egan (1993) and Gelder and Vroomen (2000) in regards to reaction time and incongruency
effect.
In summary results from multiple behavioral studies show that there is an interaction
in the processing of auditory and visual stimuli, which results in a swifter reaction time and a
more accurate recognition of stimuli. The same effect has been found in the response to
non-emotional and emotional audio-visual stimuli. Furthermore an incongruency effect has
been found in which the recognition of congruent emotions is easier compared to
incongruent emotions.
Processing audio-visual stimuli: EEG studies
Behavioral studies show that there appears to be an interaction in the processing of auditory
and visual stimuli when shown simultaneously. Nevertheless this does not tell us much
about the effects of these stimuli activity in different brain areas or how these areas affect
one another, only what the result of the integrating of stimuli is in terms of behavior. In
order to investigate the actual processing of these integrated stimuli within the brain one
can for example measure brain activity by using electroencephalography (EEG) to measure
early event-related potentials (ERP). The advantage of EEG is that it has a high temporal
resolution. This means that by using ERP one can detect changes over milliseconds as
opposed to other methods for measuring brain activity such as fMRI and PET. These
methods have a much coarser temporal resolution. This makes ERP particularly suited to
investigate early activity and small differences in latency of ERP’s evoked by different types
of stimuli. Several studies have investigated brain activity in regards to multimodal
interaction using this method using both non-emotional and emotional stimuli. For example
ERP’s evoked by non-emotional audio-visual stimuli often display a difference in amplitude
of modality specific peaks compared to the auditory and visual stimuli, even when
considering very early potentials. Furthermore the processing of audio-visual stimuli evokes
components which are not found when using unimodal auditory or visual stimuli (Giard and
Peronnet, 1999; Molholm et al., 2002, 2004). For example several studies have shown an
activation over parieto-occipital channels, 40-90ms after stimulus onset in the audio-visual
condition, with no similar activation in the unimodal conditions (Giard and Peronnet, 1999;
Fort et al., 2002). The same study showed that a component primarily evoked by auditory
stimuli, the auditory N1, was enhanced in the multimodal condition compared to the
unimodal auditory condition. Similar results were revealed by Molholm et al. (2002), who
also found an early activation in parieto-occipital regions which was enhanced in the
bimodal condition. Additionally primarily visually evoked components (visual P1 and N1)
were inhibited in the audio-visual conditions compared to unimodal conditions. Primarily
auditory components (auditory N1 and P2) however were enhanced in the audio-visual
condition. Moreover the scalp region of activation of the visual N1 was modulated in the
bimodal condition, with the N1 peaking more posterior compared to the visual condition
7
(Molholm et al., 2004). In summary these studies indicate that non-emotional audio-visual
stimuli show an early interaction during sensory processing as indicated by the unique early
ERP components in audio-visual conditions and modulation of modality specific components.
A number of ERP-studies have also investigated the processing of emotional stimuli
(de Gelder et al., 1999; Magnée et al., 2008; Pourtois et al., 2000, 2002) showing again a
modulation of the amplitude of the ERP’s by combined stimuli, as well as a modulation of
the latency of the evoked peaks. Several different components have been researched in
order to determine if components evoked by one modality are influenced by information
received from a second modality as well. For example the auditory N1 and P2b (Pourtois et
al., 2000, 2002) and the visual N2 (Magnée et al., 2008). The amplitude of the auditory N1
was enhanced in emotional audio-visual condition compared to single module conditions.
Moreover this result was not replicated when using inverted faces, indicating that the
processing of facial expressions is required for the enhancement of the N1 component in
audio-visual conditions. In addition when using emotional stimuli components may also be
influenced by the congruity of the emotions. Congruent emotional face-voice pairs may
affect components differently than incongruent pairs. It was found that congruent face-voice
pairs elicited an earlier P2b component compared to incongruent face-voice pairs. In
accordance with the auditory components the visual N2 shows similar results. This
component is affected by emotional sound. Furthermore this component is found to be
modulated differently by different emotional voices. E.g. displaying larger amplitude for
fearful voices compared to happy voices. Moreover this peak displays a larger amplitude for
congruent fearful face-voice pairs, however neither congruent nor incongruent face-voice
pairs led to an enhanced amplitude in the happy voice condition. In contrast several studies
have found a slightly different effect in regards to congruency. For example several studies
revealed that when comparing fearful and happy audio-visual stimuli, incongruent stimuli
evoke higher activity compared to congruent stimuli (Magnée et al., 2011, Paulmann and
Pell, 2010, Grossman et al., 2006). Additionally it was found that this congruency effect was
dependent on the length of the auditory stimuli, with shorter auditory fragments evoking
larger activity in congruent conditions and longer fragments evoking larger activity in
incongruent conditions (Paulmann and Pell, 2010).
In addition to these auditory and visual components de Gelder et al. (1999)
researched mismatch negativity (MMN). The MMN is often examined in auditory studies as a
method to research the processing of auditory stimuli. When a series of similar auditory
stimuli is followed by incongruent auditory stimuli it will elicit the MMN. The amplitude of
this brainwave is larger in proportion to the difference between the standard and the
deviant stimuli. The MMN was known to be sensitive to only one modality, but recent
studies have shown that aside from the processing of auditory stimuli, it can also reflect the
processing of cross-modal stimuli. De Gelder et al. (1999) revealed that a series of
emotionally congruent audio-visual stimuli (an angry voice combined with an angry face)
followed by an incongruent pair (an angry voice combined with a sad face) elicit the same
8
MMN. Additionally the results were repeated when a series of incongruent face-voice pairs
was followed by a congruent pair.
As mentioned the MMN is thought to primarily reflects auditory processing. In recent
years an analogous component evoked by visual stimuli has been found, the vMMN (PazoAlvarez et al., 2003). This component shows enhanced activation to incongruent visual
stimuli interrupting a series of similar visual stimuli. This vMMN has been used to investigate
the effect of different facial expressions. In a recent study Zhao and Li (2006) investigated a
mismatch negativity evoked by facial expressions when participants focused on another task.
Participants were asked to determine the pitch of a series of tones while they were
simultaneously shown a series of neutral faces which was interrupted by either sad or happy
faces. It was shown that the vMMN evoked by sad faces were more negative compared to
happy faces. It was concluded that facial expression can elicit a vMMN and that different
expression evoked different amplitude in the vMMN.
In accordance with the results found with non-emotional audio-visual stimuli these
studies show a modulation of early ERP components, both auditory and visual. Furthermore
the MMN evoked by emotional visual stimuli show that the processing of emotional auditory
stimuli is affected by the processing of simultaneously presented emotional visual stimuli as
well as vise versa. Further analysis of the results show that the modulation of early ERP
components is quite similar in emotional and non-emotional conditions, for example an
enhanced amplitude of auditory components in audio-visual conditions. However the
investigation of emotional audio-visual integration also shows that the modulation of the
components may depend on the type of emotion presented by the stimuli with some
components showing a larger amplitude for particular emotions.
However although these studies gives us a good understanding of the interaction of
audio-visual stimuli on a fine temporal scale, EEG has a low spatial resolution and is not
suited to investigate activity evoked by stimuli over large areas. Other methods such as tract
tracing and neuro-imaging may clarify interaction between specific auditory and visual
regions in the brain. As opposed to EEG imaging techniques have a low temporal resolution,
but a high spatial resolution. These techniques are therefore useful to research the location
of activation in response to stimuli.
Processing audio-visual stimuli: imaging studies
It is generally accepted that various sensory input is processed in sensory specific cortices
such as the auditory, tactile and visual cortex. These cortices are hierarchal with connectivity
between lower and higher levels. Furthermore traditional models indicated that specific
sensory systems are linked by association cortices which are activated during integration of
multisensory processing. This leads to a bottom-up model that states that different sensory
information converge in specific higher order multimodal areas. The bottom-up model
assumes that sensory specific stimuli are first processed separately in modality specific brain
regions and are integrated subsequently in higher order multimodal brain areas using feed9
forward projections More recently however evidence has been found supporting the topdown model which states that these multimodal areas project to sensory specific areas
through feedback. Both models have been supported by subsequent anatomical and imaging
studies using both emotional and non-emotional stimuli. However as seen in the EEG studies
the processing of audio-visual stimuli may be different for emotional and non-emotional
stimuli. In order to get a full understanding of the processing of emotional audio-visual
stimuli it is not only important to compare it to unimodal stimuli, but to non-emotional
stimuli as well. The next part will therefore will therefore first review studies investigating
non-emotional audio-visual stimuli.
A number of studies have found that brain regions which are thought to be unimodal
and should therefore be activated by a single modality are actually activated by other
modalities as well (Calvert et al., 2002; Finney et al., 2001; Sadato et al., 1996). For example
Calvert et al. (2002) showed that lipreading activated the auditory cortex even in absence of
sound. However this was only the case if linguistic facial movements were shown, if nonlinguistic facial movements were viewed, no activation in the auditory cortex was found.
Further evidence of activation of auditory cortex by visual stimuli comes from Finney et al.
(2001). They examined the activation of the auditory cortex by visual stimuli in deaf subjects.
It was found that a non-linguistic visual moving object elicited activation in the auditory
cortex. Both of these studies indicate that the auditory cortex can be activated by both
language related and non-language related visual stimuli. In addition other sensory specific
areas have been found to show activation in response information from other senses.
Sadato et al. (1996) showed that the visual cortex can be activated by sensory input in
Braille-reading in deaf people in both linguistical and non-linguistical tasks. Furthermore the
removal of one sensory modality may also lead to a rearrangement of the connectivity
between modalities. For example Weeks et al. (2000) revealed that the loss of sight leads to
a change in connectivity in the right posterior parietal cortex which participates in the
localization of sound in congenitally blind. In sum these studies show that primarily unimodal
cortices may be activated by sensory input from other senses.
Brain imaging and tracing have also been used extensively to determine which brain
regions show activation when a person is presented with bimodal stimuli by using both
emotional and non-emotional bimodal stimuli. Furthermore several studies have
investigated the connectivity between auditory and visual regions in several different species
when presented with non-emotion stimuli using both imaging and tracking techniques.
These studies revealed connections between a great number of regions (Table 1). For
example it was also revealed that primary auditory sites project to both auditory and visual
brain areas (Eckert et al, 2008). This result was further supported by a tracing study by
Rockland and Ojima (2003). Similar results were revealed by Falchier et al. (2000) and Cappe
and Baron (2005), who both showed a connectivity between visual regions and auditory
regions. In summary these studies revealed connections both within as well as between
sensory specific brain regions.
10
Further investigation into brain activity evoked by non-emotional audio-visual stimuli
compared to unimodal stimuli revealed that non-emotional bimodal stimuli evoked an
enhanced activity in several different areas. For example Calvert et al. (2000) found
enhanced activation in the primary and secondary auditory cortex as well as the vision
motor cortex. And although some slight differences in specific activation sites were revealed,
these results were repeated by Scheef et al. (2009), indicating quite strongly that these brain
areas are involved in the processing of non-emotional bimodal stimuli (Table 2).
Moreover to determine the processing of bimodal emotional stimuli several studies
have investigated brain activity when participants are presented with emotional audio-visual
stimuli. A number studies have found that found that emotional bimodal stimuli evoked an
enhanced activity compared to neutral bimodal stimuli in the auditory cortex (Ethofer et al.,
2006; Robins et al., 2009). Further analysis however showed that different emotions
activated specific regions (Table 3). In addition a number of studies have compared bimodal
emotional stimuli to unimodal. These studies have found among others activation in
auditory and multimodal brain areas (Table 4). Further analysis showed again a differential
enhancement for different emotions.
Table 1: Anatomical region showing connectivity within and between sensory specific
modalities (inter-/intramodal) across studies.
Study
Projection site
Eckert et al., 2008 A1
MT+
Left V1
Right V1
Rockland and
Ojima, 2003
Falchier et al.,
2002
Cappe and Baron
Parietal lobule
Caudolateral belt
Middle lateral belt
Calcarine site
Periperal V1
Visual cortex
Receiving site
Left and right superior temporal
gyrus
Medial geniculate nucleus
Anterior calcarine cluster
Left and right anterior calcarine
fissure
Posterior thalamic nuclei
Lateral occipito-temporal Cortex
Heschl’s gyrus
MT+
Lateral geniculate nucleus
MT+
Lateral geniculate nucleus
Calcarine fissure
Dorsal V2/V1
V2/V1
V2
A1
STP
Auditory cortex
Inter-/
intramodal
Intra
Intra
Inter
Intra
Inter
Intra
Intra
Intra
Intra
Inter
Intra
Inter
Inter
Inter
Intra
Inter
Inter
Inter
11
Table 2: Areas of enhanced activation in non-emotional audio-visual condition compared to
unimodal conditions across studies.
Study
Calvert et al.,
2000
Comparison
Congruent bimodal to
unimodal
Scheef et al.,
2009
Bimodal to unimodal
area of enhanced activation
Right fusiform gyrus
Right middle occipital gyrus
Left middle frontal gyrus
Right inferior parietal lobule
Right occipital temporal gyrus
Superior temporal sulcus
Left Heschl's gyrus
Left cuneus
Middle and inferior occipital lobe
Middle and superior temporal lobe
Medial and superior frontal lobe
Right posterior lobe cerebellum
Fusiform gyrus
Left cuneus
Cingulate gyrus
Table 3: areas of enhanced activity in emotional audio-visual condition compared to neutral
audio-visual conditions across studies
Study
Emotion
Robins et al., 2009 Angry, fearful and happy
Angry
Happy
Fearful
kreifelts et al.,
2007
Alluring, angry disgusted,
fearful, happy and sad
Area of enhanced activation
Right anterior superior temporal gyrus
Left fusiform gyrus
Right anterior superior temporal gyrus
Left superior temporal gyrus
Lateral fissure
Right anterior superior temporal gyrus
Left superior frontal gyrus
Right superior temporal gyrus
Right fusiform gyrus
Left anterior lateral fissure
Right superior frontal gyrus
Posterior thalamus
Posterior superior temporal gyrus
Right thalamus
12
Table 4: area of enhance activation in emotional bimodal conditions compared to unimodal
conditions
Study
Emotion
Robins et al., 2009 Angry, fearful and happy
Fearful
Pourtois et al.,
2005
Fearful and happy
Fearful
Happy
Kreifelts et al.,
2007
Alluring, angry disgusted,
fearful, happy, sad and
neutral
Alluring, angry, disgusted,
fearful, sad and neutral
Alluring, angry and sad
Area of enhanced activity
Posterior superior temporal sulcus
Left superior temporal sulcus
Right posterior superior temporal sulcus
Right temporal pole
Left anterior superior temporal sulcus
Left middle temporal gyrus
Fusiform gyrus
Posterior middle temporal gyrus
Superior frontal gyrus
Middle frontal gyrus
Inferior parietal lobule
Left posterior superior temporal gyrus
Right posterior superior temporal gyrus
Right thalamus
In addition the effect of congruent compared to incongruent non-emotional and emotional
audio-visual stimuli on brain activity has also been studied (Dolan et al., 2003; Jones and
Callan, 2003). Jones and Callan (2003) found a higher activation in the right supramarginal
gyrus and the left inferior parietal lobule during the presentation of non-emotional
incongruent stimuli compared to congruent stimuli. In addition incongruent stimuli resulted
in more activation in the right precentral gyrus. Additionally Dolan et al. (2003) studied the
congruency effect of emotional bimodal stimuli. Enhanced activity was found in for example
the left amygdala and the right fusiform cortex for fearful congruent conditions. Happy
congruent conditions however elicited enhanced activity in different brain regions,
demonstrating again a differentiation in the activity evoked by different emotions.
Furthermore in order to understand the integration of emotional bimodal stimuli
Kreifelts et al. (2007) investigated the connectivity between specific unimodal brain regions.
An increased connectivity was found between the right and left posterior superior temporal
sulcus, as well as the right thalamus and several other structures such as the fusiform gyrus.
These results combined with the results of the previously mentioned studies indicate that
the superior temporal sulcus and gyrus may play a major role in the processing of bimodal
emotional stimuli. Furthermore the results also show that different besides these areas each
emotion elicits enhancement in specific brain regions.
13
In summary as EEG studies, imaging studies show that emotional audio-visual stimuli
show a modulation of brain activity compared to unimodal stimuli. Although compared to
non-emotional bimodal stimuli, emotional stimuli may have a stronger or comparable effect
on brain activity dependent on the type of emotion.
Discussion
This paper looks into literature on the effect of emotional audio-visual stimuli compared to
unimodal emotional stimuli and bimodal neutral stimuli. A great number of studies has
found differences in behavior and brain activity in response to both emotional and nonemotional bimodal and unimodal stimulation. The comparison of emotional bimodal and
unimodal stimuli shows differences both behavior and brain activity. And although there are
some differences in the processing of emotional bimodal stimuli compared to neutral
bimodal stimuli, e.g. different emotions showing enhance activity in different brain areas,
there are also several similarities. For example, both evoke an enhanced activity in sensory
specific brain areas in multimodal compared to unimodal conditions. In addition both
emotional and non-emotional multimodal stimuli evoke activity in multimodal regions.
One major research point in behavioral studies for comparing unimodal and bimodal
stimuli is the difference in RT. In general multimodal stimulation leads to an increase in RT in
comparison to unimodal stimulation. Two types of models are often used to explain the
result; the race-model and the co-activation models. As said before as long as the increase in
RT does not exceed the probability summation and violates the inequality model the racemodel is considered sufficient to explain the increase in RT. In this case there is no reason to
assume integration of bimodal sensory information. In contrast if the increase in RT does
violate the inequality model the race model is rejected and it is assumed that integration of
bimodal input occurs as is the case in co-activation models. For example the race-model has
been found to be sufficient to explain the difference in RT in several studies examining visual
unimodal stimuli presented in left or right visual field or in both visual fields (Corballis, 1998,
2002; Murray et al., 2001). Corballis (1998, 2002) found a faster response to presentation in
both visual fields. However this increase in RT did not exceed the redundancy gain predicted
by probability summation and the race-model was not rejected. Contrarily other studies
have found a violation of the race-model. These studies concluded that the race-model was
insufficient to explain the increase in reaction time and that this increase was more likely a
result of co-activation of different brain regions instead of simply a result of the fastest
processing of one of the presented stimuli (Collignon et al., 2008, 2010; Marzi et al., 1996,
Molholm et al., 2002, 2006, Teder-Sälejärvi et al., 2005). In short there is a violation of the
inequality model in a number of studies. This indicates an interaction in the processing of
auditory and visual stimuli. And although some studies have also found the race-model to be
sufficient in explaining the increase in RT, this in itself does not exclude the possibility of
interaction between unimodal brain areas. Nor does it exclude interaction in the processing
of audio-visual stimuli. In short there is no evidence in behavioral studies to assume that
14
there is no interaction in the processing of audio-visual stimuli and quite some results that
suggest that there is.
Moreover not only is the interaction of audio-visual stimuli supported by behavioral
studies, it is also supported by EEG and imaging studies. This is shown by the modulation of
ERP’s by emotional bimodal stimuli and the enhanced brain activity revealed by imaging
studies in comparison to both emotional unimodal and non-emotional bimodal stimulation.
Again when comparing emotional stimuli with non-emotional stimuli there is very little
difference overall; both result in modulation of modality specific ERP components and an
enhancement of activity in modality specific and multimodal brain areas. Unlike behavioral
studies however both EEG and imaging studies also show that the processing of emotional
audio-visual stimuli is slightly more complex than non-emotional audio-visual stimuli.
Different emotions tend to have a different effect on the modulation of ERP component. I.e.
some have little to no effect while other have a very strong effect. Additionally different
emotions tend to eliciting enhanced activity in different brain areas.
Despite the slight differences between emotional and non-emotional studies,
behavioral and brain activity studies do indicate an interaction in the processing of audiovisual stimuli. This interaction might be realized in different ways. Two models are often
used in order to explain the interaction. First is the bottom-up model in which it is assumed
that sensory specific stimuli are first processed separately in modality specific brain regions
and are integrated subsequently in higher order multimodal brain areas using feed-forward
projections. Another model is the top-down model which assumes that multisensory stimuli
are processed in multimodal brain areas, which then project to sensory specific unimodal
areas. This model emphasizes the importance of feedback connections from multimodal to
unimodal brain areas. Traditionally the processing of multisensory stimuli is thought to be a
bottom-up process, driven by feed-forward projections. It has been extensively investigated
in both imaging and tracing studies in animals, as well as in humans using non-invasive
methods. These studies investigating the neural processing of multimodal stimuli revealed
brain areas responding to multiple modality. These areas are found to receive input from
modality specific regions, which is in line with the bottom-up model (Macaluso, 2006).
Imaging studies also found, aside from activation in sensory specific brain areas, activation of
multisensory regions by unimodal input independent of attentional awareness or selective
attention to a single modality (Macaluso et al., 2002). This result is also in concordance with
the bottom-up model, however it does not exactly exclude the top-down model.
Furthermore a number of studies have found evidence more in line the top-down model. For
example several EEG studies have found ERP’s evoked by audio-visual stimuli as early as 40
ms over parietal-occipital channels (Giard and Peronnet, 1999; Fort et al., 2002). These early
potentials indicate a feed-forward instead of a feedback mechanism. The activation of
unimodal and multimodal brain areas are found around the same time. Furthermore these
early ERP’s are found too soon after stimulus onset to have been caused by feedforward
projections from unimodal brain areas (Foxe and Schroeder, 2005). Further evidence
15
supporting top-down modulation comes from imaging studies. Several studies have found
and enhanced activation of sensory specific cortices during the processing of multimodal
stimuli compared to unimodal stimuli. If there were only feed-forward connections from
unimodal to multimodal areas, this enhancement would not be found. Additionally
connectivity between primary sensory modules (Falchier et al., 2002; Rockland and Ojima,
2003) also argues against purely feed-forward connections. In summary both behavioral and
brain activity studies have contributed to the knowledge pertaining audio-visual integration.
Behavioral studies display differences in reaction to unimodal and multimodal stimuli and
both EEG and imaging studies give us a better insight in brain activity evoked by these
stimuli. However when using EEG and imaging techniques to gain more insight in the nature
of the integration of audio-visual stimuli results can be conflicting. EEG and imaging studies
have revealed evidence supporting both the bottom-up model as well as the top-down
model. It seems however that the bottom-up model is mostly supported by imaging studies,
while evidence supporting the top-down model comes primarily from EEG studies and to a
lesser degree from imaging studies. Results indicate that both bottom-up and top-down
projections play a part in the integration of audio-visual stimuli.
Overall research to the integration of emotional audio-visual stimuli seems to still be
in the exploratory phase. Studies primarily focus on activation elicited by audio-visual
stimuli. And although a number of studies have gone further than simply studying which
brain areas show activation in response to particular stimuli and have investigated the
connectivity between specific brain regions when participants are presented non-emotional
stimuli, similar studies using emotional audio-visual stimuli are scarce. These type of studies
would increase our understanding of the nature of the processing of emotional and nonemotional audio-visual stimuli. In order to expand what is known on audio-visual processing
and specifically emotional audio-visual processing both EEG and imaging techniques should
be used. As said before EEG has a high temporal and a low spatial resolution, fMRI on the
other hand has a high spatial and a low temporal resolution. This results in complementary
but separate information from both techniques. In addition it is also possible to use
combined EEG-fMRI. This may reveal underlying generators of EEG activity which cannot be
found by using EEG alone (Goldman et al., 2000; Lemieux et al, 2001). Combining these
offers the advantage of a high spatial and temporal resolution. However it is still a fairy new
technique and there seems to be no consensus yet over how to most effectively combine
the techniques. Still it may give us some valuable insight in the processing of emotional
audio-visual stimuli. As of yet very few studies have used concordant EEG/fMRI to study
bimodal integration, let alone emotional audio-visual integration.
All in all emotional audio-visual stimuli modulate behavior and brain activity in
several ways. First of all it leads to a faster RT and a more accurate recognition of stimuli
compared to unimodal stimulation. Furthermore the processing of bimodal stimuli leads to a
modulation of several ERP component and evokes enhanced brain activity in unimodal and
multimodal brain areas. However these results are dependent on the type of emotion
16
presented. Some emotions affect ERP components and brain activity stronger than others.
Furthermore different emotions may cause enhanced activity in different brain areas. Still
even with the slightly different results in response to different emotions it is clear that
emotional audio-visual stimuli affect brain activity and behavior differently than unimodal
stimuli. At this point in time it is unclear whether the integration of emotional audio-visual
stimuli is predominantly driven by feed-forward or feedback projections. Both EEG and
imaging studies reveal that purely feed-forward projections of unisensory areas to
multisensory areas are not enough to explain the activity found in uni- and multisensory
areas for both emotional and non-emotional stimuli, indicating at least a partial role of
feedback connections. Considering the results it is more logically to assume that multimodal
stimuli are processed using a combination of feed-forward and feedback projections. In
short emotional audio-visual stimuli most definitely show interaction during processing of
sensory information and show a different processing of sensory information compared to
unimodal stimuli and even non-emotional stimuli. However the underlying mechanisms are
not quite clear yet.
Acknowledgements:
I would like to thank my supervisors and reviewers for the assistance in writing and for reviewing this
thesis.
17
References
Calvert, G.A., Bullmore, E.T., Brammer, M.J., Campbell, R., Williams, S.C.R., McGuire, P.K., Woodruff, P.W.R.,
Iversen, S.D., David, A.S. (2002). Activation of auditory cortex during silent lipreading. Science. 276,
593-596.
Calvert, G.A., Campbell, R., Brammer, M.J. (2000). Evidence from functional magnetic resonance imaging of
crossmodal binding in the human heteromodal cortex. Current Biology. 10:11, 649-657.
Cappe, C., Barone, P. (2005). Heteromodal connections supporting multisensory integration at low levels of
cortical processing in the monkey. European Journal of Neuroscience. 22, 2886-2902.
Collignon, O., Girard, S., Gosselin, F., Roy, S., Saint-Amour, D., Lassonde, M., Lepore, F. (2008). Audio-visual
integration of emotional expression. Brain Research. 1242, 126-135.
Collignon, O., Girard, S., Gosselin, F., Roy, S., Saint-Amour, D., Lepore, F., Lassonde, M. (2010). Women process
multisensory emotion expressions more efficiently than men. Neuropsychologia. 48, 220-225.
Corballis, M.C. (1998). Interhemispheric neural summation in the absence of the corpus callosum. Brain. 121,
1795-1807.
Corballis, M.C. (2002). Hemispheric interaction in simple reaction time. Neuropsychologia. 40, 423-434.
Dolan, R.J., Morris, J.S., de Gelder, B. (2001). Crossmodal binding of fear in voice and face. PNAS. 98:17, 1000610010.
Eckert, M.A., Kamdar, N.V., Chang, C.E., Beckmann, C.F., Greicius, M.D., Menon, V. (2008). A cross-modal
system linking primary auditory and visual cortices: Evidence from intrinsic fMRI connectivity analysis.
Human Brain Mapping. 29, 848-857.
Ekman, P., Friesen, W.V. (1969). The repertoire of nonverbal behavior – Categories, origins, usage and coding.
Semiotica. 1, 49-98.
Ekman, P., Friesen, W.V. (1971). Constants across culture in the face and emotion. Journal of Personality and
Social Psychology. 17:2, 124-129.
Ethofer, T., Anders, S., Erb, M., Droll, C., Royen, L., Saur, R., Reiterer, S., Grodd, W., Wildgruber, D. (2006).
Impact of voice on emotional judgment of faces: An event-related fMRI study. Human Brain Mapping.
27, 707-714.
Falchier, A., Clavagnier, S., Barone, P., Kennedy, H. (2002). Anatomical evidence of multimodal integration in
primate striate cortex.
Finney, E.M., Fine, I., Dobkins, K.R. (2001). Visual stimuli activate auditory cortex in the deaf. Nature
Neuroscience. 4:12, 1171-1173.
Fort, A., Delpuech, C., Pernier, J., Giard, M.H. (2002). Dynamics of cortico-subcortical cross-modal operations
involved in audio-visual object detection in humans. Cereb Cortex. 12:10, 1031-1039.
Foxe, J.J., Schroeder, C.E. (2005). The case for feedforward multisensory convergence during early cortical
processing. NeuroReport. 16:5, 419-423.
de Gelder, B., Böcker, K.B.E., Tuomainen, J., Hensen, M., Vroomen, J. (1999). The combined perception of
emotion from voice and face: early interaction revealed by human electric brain responses.
Neuroscience Letters. 260, 133-136.
de Gelder, B., Vroomen, J. (2000). The perception of emotion by ear and by eye. Cognition and Emotion. 14:3,
289-311.
de Gelder, B., Vroomen, J., Bertelson, P. (1998). Upright but not inverted faces modify the perception of
emotion in the voice. Current Psychology of Cognition. 17, 1021-1032.
Giard, M.H., Peronnet, F. (1999). Auditory-visual integration during multimodal object recognition in humans:
A behavioral and electrophysiological study. Journal of Cognitive Neuroscience. 11:5, 473-490.
Gielen, S.C.A.M., Schmidt, R.A., van den Heuvel, P.J.M. (1983). On the nature of intersensory facilitation of
reaction time. Perception and Psychophysics. 34:2, 161-168.
Goldman, R.I., Stern, J.M., Engel, J. Jr, Cohen, M.S. (2000). Aquiring simultaneous EEG and functional MRI.
Clinical Neurophysiology. 111:11, 1974-1980.
Gondan, M., Niederhaus, B., Rösler, F., Röder, B. (2005). Multisensory processing in the redundant-target
effect: A behavioral and event-related potential study. Perception and Psychophysics. 67:4, 713-726.
Grant, K.W. (2001). The effect of speechreading on masked detection threshold for filtered speech. The Journal
of the Acoustical Society of America. 109:5, 2272-2275.
Grant, K.W., Seitz, P.,-F. (2000). The use of visible speech cues for improving auditory detection of spoken
sentences. The Journal of the Acoustical Society of America. 108:3, 1197-1208.
18
Grossman, T. (2010). The development of emotional perception in face and voice during infancy. Restorative
Neurology and Neuroscience. 28, 219-236.
Jones, J.A., Callan, D.E. (2003). Brain activity during audiovisual speech perception: An fMRI study of the
McGruk effect. NeuroReport. 14:8, 1129-1133.
Kreifelts, B., Ethofer, T., Grodd, W., Erb, M., Wildgruber, D. (2007). Audiovisual integration of emotional signals
in the voice and face: An event-related fMRI study. NeuroImage. 37, 1445-1456.
Lemieux, L., Salek-Haddadi, A., Josephs, O., Allen, P., Toms, N., Scott, C., Krakow, K., Turner, R., Fish, D.R.
(2001). Event-related fMRI with simultaneous and continuous EEG: Description of the method and
initial case support. NeuroImage. 14:3, 780-787.
Macaluso, E. (2006). Multisensory processing in sensory-specific cortical areas. Neuroscientist. 12:4, 327-338.
Macaluso, E., Frith, C.D., Driver, J. (2002). Directing attention to locations and to sensory modalities: multiple
levels of selective processing revealed with PET. Cereb Cortex. 12, 357–68.
Magnée, M.J.C.M., de Gelder, B., van Engeland, H., Kemner, C. (2008). Atypical processing of fearful face-voice
pairs in Pervasive Development Disorder: An ERP study. Clinical Neurophysiology. 119, 2004-2010.
Magnée, M.J.C.M., de Gelder, B., van Engeland, H., Kemner, C. (2011). Multisensory integration and attention
in Autism Spectrum Disorder: evidence from Event-Related Potentials. PLoS ONE 6:8, e24196.
Marzi, C.A., Smania, N., Martini, M.C., Gambina, G., Tomelleri, G., Palamara, A., Alessandrini, F., Prior, M.
(1996).Implicit redundant-target effect in visual extinction. Neuropsychologia. 34:1, 9-22.
Massaro, D.W., Egan, P.B. (1996). Perceiving affect from the voice and the face. Psychonomic Bullitin and
Review. 3:2, 215-221.
McGurk H., MacDonald J. (1976). Hearing lips and seeing voices. Nature. 264, 746–748.
Miller, J. (1982). Divided attention: Evidence for coactivation with redundant signals. Cognitive Psychology. 14,
247-279.
Miller, J. (1991). Channel interaction and the redundant-target effects in bimodal divided attention. Journal of
Experimental Psychology: Human Perception and Performance. 17:1, 160-169.
Molholm, S., Ritter, W., Murray, M.M., Javitt, D.C., Foxe, J.J. (2004). Multisensory visual-auditory object
recognition in human: A high-density electrical mapping study. Cerebral Cortex. 14, 452-465.
Molholm, S., Ritter, W., Murray, M.M., Javitt, D.C., Schroeder, C.E., Foxe, J.J. (2002). Multisensory auditoryvisual interactions during early sensory processing in humans: a high-density electrical mapping study.
Cognitive Brain Research. 14, 155-128.
Molholm, S., Sehatpour, P., Mehta, A.D., Shpaner, M., Gomez-Ramirez, M., Ortigue, S., Dyke, J.P., Schwartz,
T.H., Foxe, J.J. (2006). Audio-visual multisensory integration in superior parietal lobule revealed by
human intracranial recordings. Journal of Neurophysiology. 96, 721-729.
Murray, M.M., Foxe, J.J., Higgins, B.A., Javitt, D.C., Schroeder, C.E. (2001). Visio-spatial neural response
interactions in early cortical processing during a simple reaction time task: A high density electrical
mapping study. Neuropsychologia. 39, 828-844.
Olsen, I.R., Plotzker, A., Ezzyat, Y. (2007). The enigmatic temporal pole: A review of findings on social and
emotional processing. Brain. 130, 1718-1731.
Paulmann, S., Pell, M.D. (2010). Contextual influences of emotional speech prosody on emotional face
processing: How much is enough? Cognitive, Affective and Behavioral Neuroscience. 10:2, 230-242.
Pazo-Alvarez, P., Cadaveira, F., Amenedo, E. (2003). MMN in the visual modality: A review. Biological
Psychology. 63, 199-236.
Pell, M.D., Monetta, L., Paulmann, S., Kotz, S.A. (2009). Recognizing emotions in a foreign language. Journal of
Nonverbal Behavior. 33, 107-120.
Pourtois, G., Debatisse, D., Despland, P.-A., de Gelder, B. (2002). Facial expressions modulate the time course of
long latency auditory brain potentials. Cognitive Brain Research. 14, 99-105.
Pourtois, G., de Gelder, B., Bol, A., Crommelink, M. (2005). Perception of facial expressions and voices and of
their combination in the human brain. Cortex. 41, 49-59.
Pourtois, G., de Gelder, B., Vroomen, J., Rossoin, B., Crommelinck, M. (2000). The time-course of intermodal
binding between seeing and hearing affective information. NeuroReport. 11:6, 1329-1333.
Robins, D.L., Hunyadi, E., Schultz, R.T. (2009). Superior temporal activation in response to dynamic audio-visual
emotional cues. Brain and Cognition. 69, 269-278.
Rockland, K.S., Ojima, H. (2003). Multisensory convergance in calcarine visual areas in macaque monkey.
International Journal of Psychology. 50, 19-26.
Sadato, N., Pascual-Leone, A., Grafman, J., Ibañez, V., Deiber, M.-P., Dold, G., Hallet, M. (1996). Activation of
the primary visual cortex by Braille reading in blind subjects. Nature. 380, 526-528.
19
Scherer, K.R., Banse, R., Wallbott, H.G. (2001). Emotion inferences from vocal expression correlate across
languages and cultures. Journal of Cross-Cultural Psychology. 32:1, 76-92.
Scheef, L., Boecker, H., Daamen, M., Fehse, U., Landsberg, M.W., Granath, D.-O., Mechling, H., Effenberg, A.O.
(2009). Multimodal motion processing in area V5/MT: Evidence from an artificial class of audio-visual
events. Brain Research. 1252, 94-104.
Schwartz, J.-L., Berthommier, F., Savariaux, C. (2004). Seeing to hear better: Evidence for early audio-visual
interactions in speech identification. Cognition. 93:2, B69-B78.
Schröger, E., Widmann, A. (1998). Speeded responses to audiovisual changes result from bimodal integration.
Psychophysiology. 35, 755-759
Shams, L., Kamitani, Y., Thompson, S., Shimojo, S. (2001). Sound alters visual evoked potentials in humans.
NeuroReport. 12:17, 3849-3852.
de Silva, L.C., Miyasato, T., Nakatsu, R. (1997). Facial emotion recognition using Multi-modal information.
Communications and Signal processing. 1, 397-401.
Spence, C., Driver, J. (2000). Attracting attention to the illusory location of a sound: reflexive crossmodal
orienting and ventriloquism. NeuroReport. 11:9, 2057-2061.
Teder-Sälejärvi, W.A., Di Russo, F., McDonald, J.J., Hillyard, S.A. (2005). Effects of spatial congruity on audiovisual multimodal integration. Journal of Cognitive Neuroscience. 17:9, 1396-1409.
Teder-Sälejärvi, W.A., McDonald, J.J., Di Russo, F., Hillyard, S.A. (2002). An analysis of audio-visual crossmodal
integration by means of event-related potential (ERP) recordings. Cognitive Brain Research. 14, 106114.
Thompson, W.F., Balkwill, L-L. (2006). Decoding speech prosody in five languages. Semiotica. 158 (1/4). 407424.
Vroomen, J., Collier, R., Mozziconacci, S. (1993). Duration and intonation in emotional speech. Proceedings of
the Third European Conference on Speech Communication and Technology, Berlin. 577-580. Berlin:
ESCA.
Vroomen, J., Bertelson, P., de Gelder, B. (2001a). Directing spatial attention towards the illusory location of a
ventriloquized sound. Acta Psychologica. 108, 21-33.
Vroomen, J., Driver, J., de Gelder, B. (2001b). Is cross-modal integration of emotional expressions independent
of attentional resources? Cognitive, Affective, and Behavioral Neuroscience. 1:4, 382-387.
Zhao, L., Li, J. (2006). Visual mismatch negativity elicited by facial expressions under non-attentional conditions.
Neuroscience Letters. 410, 126-131.
20
Download