What is the neural basis for phonetic categories in the auditory cortex

advertisement
The influence of perceptual categories on auditory
feedback control in speech
ORAL EXAM REPORT
Carrie Niziolek
Speech and Hearing Bioscience and Technology Program
Harvard-MIT Division of Health Sciences and Technology
TABLE OF CONTENTS
I.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
II.
Specific Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
III.
Background and Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
IV.
Research Design and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 6
V.
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2
I. INTRODUCTION
Phonemes are the theoretical representations of sound units in human language.
When we learn a first language, our auditory experience shapes how we will segment the
acoustic space into phonetic units — where we draw the boundaries that differentiate speech
sounds. {{As we continue to master a native language, phonetic distinctions that are not
present in the auditory input are lost in favor of a more robust coding of the speech sounds
that do exist in the language.[1-3]}} These phonetic boundaries a
When we learn to speak, we are learning a mapping between the motor commands
for speech gestures and the sounds these gestures produce. In other words, we learn targets
that correspond to phonemes in the language, and motor commands that carry out these
targets to produce a particular speech sound. [] {{regions in a planning space}} Our own
voices act as auditory feedback, enabling a precise tuning, over time, of the motor-acoustic
mapping. Adults with years of speech experience have well-tuned neural mappings. When
there is a disconnect between the expected and observed acoustic consequences of an
articulatory gesture, feedback control allows for detection and consequent correction of the
perceived error. By influencing subjects’ perception of their own speech, we can induce
such a discrepancy and observe the neural consequences as well as any compensatory
movements of the speech articulators.
This research concerns the influence of learned phonetic categories on auditory
feedback control. Does the
Cross the boundary to be a different phoneme.
II. SPECIFIC AIMS
Phonetic boundaries divide speech sounds into discrete chunks within a language.
These boundaries function as discontinuities along a perceptual continuum; that is, a
continuous acoustic space is warped at the boundaries to yield a perceptual representation
that is non-continuous and categorical. {{Since languages place phonetic boundaries in
different places, these differential mappings must develop as a consequence of language
experience.}} In this proposal, I will explore quantitatively the ways in which
psychophysical judgments change as a function of learned acoustic input. My specific aim is
to differentiate the neural signal.
In this experiment, the first and second formant frequencies of subjects’ speech are
perturbed in real time and fed back to subjects through headphones. By changing the filter
characteristics in this way, we change the character of the vowel. We thus create a sudden,
unexpected mismatch between the vowel target and the perceived realization. A subject who
says “bet” might hear himself instead saying “bit” or “bat”, for example. {{By altering the
3
speech feedback signal before it reaches the ear, we can induce perceived errors and observe
the consequences}}. Using fMRI to image the brain during this task, we can attempt to
distinguish between two types of error correction cells in auditory cortex: one type that
occurs during an auditory difference between intended and perceived sounds (withincategory shift), and another that is specific to a phonetic difference (between-category shift).
In addition to inducing categorical perception, measured by psychophysical
performance, I will examine how neural activation changes after prolonged exposure to the
novel stimuli. Using fMRI metrics, I will measure brain activation during the passive
perception of the training stimuli. Is it possible to change auditory cortical activation
patterns with short-term acoustic training? How can the categorical learning change the
distribution of the firing preferences of neurons in auditory cortical maps?
In this way, these studies will address the nature of phonetic categories in the brain,
how they are formed, and how they can be manipulated through training.
III. BACKGROUND AND SIGNIFICANCE
Introducing gradual shifts in formant structure [#] or pitch [#] causes subjects to
gradually adapt to the perturbation. That is, they produce speech whose formants or pitch
exhibit an opposing shift to counteract ((cancel out)) some of the perturbation. Recent
experiments [#] have shown subjects’ ability to compensate for rapidly introduced
perturbations, both acoustic and motor. These paradigms have only a small percentage of
perturbed trials, ensuring that the perturbation is unexpected and allowing subjects to “reset”
in between trials. This resetting prevents subjects from adapting over time and enables a
simple comparison of shifted and unshifted trials. fMRI experiments have identified a
region of temporal cortex containing “auditory error cells,” neurons that detect the
disconnect between expected and unexpected auditory feedback of speech. Tourville [#]
found auditory error cells in planum temporal ((and STG)).
Phonetic categories are discrete perceptual representations of the sounds of a
language. As distinct entities, the categories are an abstract linguistic concept, but there are
neural correlates that support their existence in the brain. There has been shown to be more
neural activity in response to “poor” phonemes, ambiguous sounds that lie near phonetic
category boundaries, than to “good” phonemes, prototypical sounds that lie squarely in the
center of the phonetic category.[4,5] This differential activation is reflected behaviorally: it is
harder for human subjects to distinguish between pairs of sounds that lie near category
centers than pairs of sounds that straddle category boundaries, even when the pairs are
separated by the same distance along some acoustic dimension.[6,7] This measurable
psychophysical effect is a hallmark of categorical perception.
4
Categorical perception is not specific to the auditory system; it is a phenomenon
present in many modalities. Phonetic category perception is interesting because it is
particularly sharp and robust within an individual while differing across individuals with
different auditory experience. It is usually assessed using stimuli that vary in a single acoustic
attribute along a continuum. For example, we can construct a continuum from [t] to [d] by
varying the voice onset time, or VOT, from 0-40 ms, evenly spaced in time. Subjects are
asked to identify randomly-presented stimuli from the continuum as either [t] or [d]. Instead
of perceiving a gradual adjustment from [t] to [d], subjects will report an abrupt change in
phoneme perception once a boundary is crossed: in this case, around 20 ms[7]. In addition,
reaction time will be highest at this boundary.[8] Finally, as mentioned above, discrimination
is best when the two stimuli come from opposite sides of the boundary.
This psychophysical evidence demonstrates the sudden perceptual discontinuities
imposed by category boundaries: a single sound is heard as either a [d] or a [t], not
somewhere in the middle. Phonemes are discrete and categorical: we give names to them,
and they allow us to discriminate words with different meanings. By processing acoustic
input into phonemes, we ignore small variations that have no phonetic consequence, while
paying close attention to the boundary regions, where small changes matter for phonetic
identity.
A robust representation of these sound categories in the brain allows us to rapidly
process and understand incoming speech, and to compare our own speech productions to
internal auditory schemata. The finding of increased brain activation to boundary stimuli
implies that more neural resources are devoted to processing these ambiguous sounds.[4,5]
The formation of phonetic categories is an example of perceptual warping of auditory space
that is contingent upon acoustic exposure.
By altering the speech feedback signal before it reaches the ear, we can induce
perceived errors
The participants in this study were mostly unaware of any auditory mischief between
microphone and headphones. H
This project is novel in several arenas. This kind of data would be extremely
valuable in localizing the neural sound maps that correspond to speech. In other words, we
can more fully understand how speech sounds are represented in the auditory cortex. These
experiments will allow for improvements of the DIVA neural network model of speech
acquisition,[12] which is used to posit neural computations that underlie speech processing,
and to model various speech pathologies.
This work also has important consequences for feedback-based training.
5
IV. RESEARCH DESIGN AND METHODS
In order to explore the role of phonetic categories in feedback control, we will
examine the neural response to a sudden disruption in the auditory feedback loop, as elicited
by an unexpected acoustic shift in real-time. In this experiment, sudden auditory
perturbations will occur during subjects’ speech in an attempt to elicit a mismatch between
the auditory speech target and the actual realization. We hypothesize this will induce activity
in the auditory errors cells that detect this disconnect. Furthermore, we will attempt to
distinguish between two populations of error cells: auditory error cells, which become
activated during an auditory difference between intended and perceived sounds, and
phonetic error cells, hypothesized to be active only when this auditory difference is large
enough to cross a phonetic category boundary. That is, the phonetic error cells detect a
mismatch between the target phoneme and the perceived phoneme. We will therefore
construct opposing shifts in formant space whose magnitudes are identical within a subject
but whose boundary
Subjects. Subjects will consist of right-handed men and women, ages 18-35, whose
first language is American English, and who have normal hearing and speech, as well as no
metal in the body (required for imaging eligibility). Subjects whose vowel boundaries are
determined to be asymmetric around a center vowel will be preferred, as this enables more
assurance
Experimental design: overview
The experiment can be broken into three phases: an initial speech production test, a
speech perception test, and an fMRI session.
Test 1: production. For each subject, vowel production data was collected in the carrier
consonants [b_d], with ten productions for each of the six vowels [i, ı, ε, æ, α, u]. Median
formant values were determined for each vowel.
Test 2: perception. Using a real-time formant shifting algorithm (developed by Marc
Boucek, Speech Communication Group), eight vowel continua were generated across the
F1-F2 spectrum. The production token closest to the median for each vowel was shifted in
formant space in ten successive increments towards its neighboring vowels. Each
continuum began at the median formant values of one vowel and ended at the median
formant values of a neighboring vowel, with one additional token added at each end. The
step size between each token in the continuum was constant on the mel scale (a
perceptually-derived logarithmic scale, where 1000 mels = 1000 Hz). Furthermore, two
continua were generated for each vowel pair: one starting from each end.
6
The randomized tokens were presented four times each to the subjects, who were
instructed to categorize what they heard as one of six possible words: bead, bid, bed, bad,
bod, or booed. The categorization data was fitted to sigmoid curves to determine an
approximate perceptual boundary between the vowels at the continuum endpoints.
Test 3: Brain imaging. Using fMRI, I will measure the BOLD response during speech,
both with and without perturbation, as well as during a baseline condition. The experiment
will be event-related, using a triggering mechanism to coordinate stimulus timing with image
acquisitions.
Subjects were scanned in a 3T Siemens Tim Trio whole-body MRI machine, located
at the Athinoula A. Martinos Imaging Center at McGovern Institute for Brain Research,
MIT. The subjects’ speech was recorded by a custom-made MR-safe microphone, and
auditory feedback was delivered via insert headphones (Stax SRS-005II electrostatic
headphones). Subjects wore supra-aural ear seals surrounded by custom-made foam helmet,
affectionately nicknamed the Head Cozy, to insulate them from the noise of the scanner.
On each trial, subjects were presented with a word (e.g. “bed”) or a control stimulus
(“***”). These visual stimuli were projected in high-contrast white-on-black and displayed
on a rear projection screen, visible to the subjects through a mirror mounted above the MRI
head coil. Subjects were instructed to read the word aloud and to remain silent on the
control trials.
Unbeknownst to the subjects, the speech trials were divided into three conditions:
One one out of every four test trials, the formants were perturbed either across or
within the vowel category of the vocalized word before being fed back to the subjects. This
elicits the percept of having mispronounced the trials word; the auditory output the subjects
expect to hear does not correspond with the artificially-shifted output of the headphones.
In summary, the four conditions experienced by the subject were:
1. baseline: a control condition in which the subject remained silent.
2. no-shift: feedback was
3. shift-within: a within-category shift was applied to the subjects’ speech.
4. shift-across: a cross-category shift, the same magnitude as that of
condition 2.
After a 2-second delay from stimulus offset, the stimulus presentation software
triggered the scanner to collect two volumes of functional data. This trigger is followed by a
pause of ((N)) seconds before the next trial to allow for the partial return of the BOLD
signal to the steady state. Because the image acquisition is timed to occur several seconds
after the stimulus onset, subjects speak in relative silence. The acquisition parameters were
typical of those used in previous speech experiments (echo planar imaging, 30 slices covering
7
the entire cortex and cerebellum aligned to the AC-PC line, 5 mm slice thickness, 0 mm gap
between slices, flip angle = 90°). [[FIGURE: Timeline for each trial.]]
In addition to the functional data, anatomical volumes were collected in order to
overlay each subject’s functional data on his or her own brain. Diffusion tensor imaging was
also performed to track white matter tracts as they travel between connected regions of the
cortex. This data will be used for functional connectivity analyses between brain regions
implicated in the task.
Data analysis
Acoustic data will be analyzed for median formant values will be compared across
no-shift, shift-within, and shift-across conditions.
The functional data will be analyzed using software packages including SPM and
Free surfer. Both voxel-based and surface-based analyses of activation will be carried out.
The pre-training activation patterns (passive listening minus resting state) will be compared
with the post-training activation patterns (passive listening minus resting state) in terms of
location and extent of activation. The behavioral data will also be compared before and after
training to look for differences in subjects’ ability to categorize and discriminate stimuli
drawn from the non-speech continuum.
Expected results
Psychophysical results show that in the majority of cases, subjects exhibited
categorical perception, as evidenced by the step slope of the sigmoid fit to the data (Figure
#). Furthermore, the perceptual boundary within a given subject differed based on the
starting point of the continua (that is, the vowel token from which the continua were
generated). Even though the tokens were presented randomly, the percept from the original
vowel tended to dominate each continuum.
Secondly, there is a great deal of variability in the location of vowel boundaries
across subjects for a given continuum (Figure #). That is, it takes a larger step size to elicit
the perception of phonetic change for some subjects than for others.
Preliminary fMRI results show
Interpretation. My hypothesis is that the learning of sound categories changes the
distribution of the firing preferences of neurons in auditory cortical maps, {{and thus
changes the discriminability of sounds from different parts of acoustic space.}} I will
interpret a change in cortical activation, in the presence of no change for control stimuli, as
evidence that cortical reorganization has occurred in response to the acoustic training. In
particular, an increase in activation implies more neural resources devoted to processing a
sound near the boundary condition, while a decrease in activation implies a concomitant
8
lessening of resources. This reallocation of neural resources for prototypical stimuli would
be consistent with past imaging work.[4] Broadly, success in this protocol would be
interpreted as new evidence for the plasticity of adult phonetic representations in cortex, and
a promising start to routines that might exploit this ability in second-language learning.
Alternative approaches. One potential limitation of this paradigm is the inability to
directory compare identical shifts: within a subject, one shift is always in an opposing
direction to the other. We have controlled for this by counterbalancing subjects for whom
the category-crossing shift is in the upward direction and subjects for whom it is in the
downward direction. However, an alternative experiment design contrasts shifts of the same
magnitude and direction across subjects. It is possible to create chains of subjects. This
design was abandoned when it proved too difficult to construct chains subject and too risky
to rely on single subjects’ data to avoid breaking the chain. (During pilot testing, one subject
dropped out of the study mid-scan after his “pair” had already been run, rendering both data
sets useless.)
Potential limitations. The success of these experiments is dependent on the accuracy of
the psychophysical test that determines category boundaries. Additionally, categories may be
unstable across time. Refining the nature of these stimuli will take up a large portion of my
research time. In addition, these experiments do bank on successful cortical reorganization
to show a really exciting result. The non-result, in which brain activation cannot be
distinguished between the within- and across-category conditions, would still remain an
interesting finding, however; this work would be helping to characterize the cortical response
due to different types of feedback perturbation and would provide the DIVA model with
potential parameters.
9
V. REFERENCES
1. Werker and Tees, 1983. Developmental changes across childhood in the perception
of non-native speech sounds. Can J Psychol, 37(2), 278-86.
2. Tees and Weker, 1984. Perceptual flexibility: maintenance or recovery of the ability
to discriminate non-native speech. Can J Psychol, 38(4):579-90.
3. Kuhl et al., 1992. Linguistic experience alters phonetic perception in infants by 6
months of age. Science, 255, 606-608
4. Guenther et al., 2004. Representation of Sound Categories in Auditory Cortical
Maps. Journal of Speech, Language, and Hearing Research, 47(1) 46-57.
5. Guenther and Gjaja, 1996. The perceptual magnet effect as an emergent property of
neural map formation. J Acoust Soc Am, 100, 1111-1121.
6. Harnad, 1986. Psychophysical and cognitive aspects of categorical perception: A
critical overview. In Harnad, Stevan (Eds)., Categorical Perception: The Groundwork
of Cognition. New York: Cambridge University Press, 1-52.
7. Blumstein et al., 2005. The perception of voice-onset time: An fMRI investigation of
phonetic category structure. J Cog Neurosci, 17(9), 1353-1366.
8. Boucher, 2002. Timing relations in speech and the identification of voice-onset
times: A stable perceptual boundary for voicing categories across speaking rates.
Percept Psychophys, 64, 121–130.
9. Miller et al., 1977. Discrimination and labeling of noise-buzz sequences with varying
noise-lead times: An example of categorical perception. J Acoust Soc Am, 60(2),
410-417.
10. Pisoni, 1997. Identification and discrimination of the relative onset of two
component tones: Implications for voicing perception in stops. J Acoust Soc Am,
61(5), 1352-1361.
11. Eimas et al., 1971. On infant speech perception and the acquisition of language. In
Harnad, Stevan (Eds), Categorical Perception: The Groundwork of Cognition. New
York: Cambridge University Press.
12. Guenther et al., 2006. Neural modeling and imaging of the cortical interactions
underlying syllable production. Brain and Language, 96 (3), pp. 280-301.
10
Download