The influence of perceptual categories on auditory feedback control in speech ORAL EXAM REPORT Carrie Niziolek Speech and Hearing Bioscience and Technology Program Harvard-MIT Division of Health Sciences and Technology TABLE OF CONTENTS I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 II. Specific Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 III. Background and Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 IV. Research Design and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 6 V. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 I. INTRODUCTION Phonemes are the theoretical representations of sound units in human language. When we learn a first language, our auditory experience shapes how we will segment the acoustic space into phonetic units — where we draw the boundaries that differentiate speech sounds. {{As we continue to master a native language, phonetic distinctions that are not present in the auditory input are lost in favor of a more robust coding of the speech sounds that do exist in the language.[1-3]}} These phonetic boundaries a When we learn to speak, we are learning a mapping between the motor commands for speech gestures and the sounds these gestures produce. In other words, we learn targets that correspond to phonemes in the language, and motor commands that carry out these targets to produce a particular speech sound. [] {{regions in a planning space}} Our own voices act as auditory feedback, enabling a precise tuning, over time, of the motor-acoustic mapping. Adults with years of speech experience have well-tuned neural mappings. When there is a disconnect between the expected and observed acoustic consequences of an articulatory gesture, feedback control allows for detection and consequent correction of the perceived error. By influencing subjects’ perception of their own speech, we can induce such a discrepancy and observe the neural consequences as well as any compensatory movements of the speech articulators. This research concerns the influence of learned phonetic categories on auditory feedback control. Does the Cross the boundary to be a different phoneme. II. SPECIFIC AIMS Phonetic boundaries divide speech sounds into discrete chunks within a language. These boundaries function as discontinuities along a perceptual continuum; that is, a continuous acoustic space is warped at the boundaries to yield a perceptual representation that is non-continuous and categorical. {{Since languages place phonetic boundaries in different places, these differential mappings must develop as a consequence of language experience.}} In this proposal, I will explore quantitatively the ways in which psychophysical judgments change as a function of learned acoustic input. My specific aim is to differentiate the neural signal. In this experiment, the first and second formant frequencies of subjects’ speech are perturbed in real time and fed back to subjects through headphones. By changing the filter characteristics in this way, we change the character of the vowel. We thus create a sudden, unexpected mismatch between the vowel target and the perceived realization. A subject who says “bet” might hear himself instead saying “bit” or “bat”, for example. {{By altering the 3 speech feedback signal before it reaches the ear, we can induce perceived errors and observe the consequences}}. Using fMRI to image the brain during this task, we can attempt to distinguish between two types of error correction cells in auditory cortex: one type that occurs during an auditory difference between intended and perceived sounds (withincategory shift), and another that is specific to a phonetic difference (between-category shift). In addition to inducing categorical perception, measured by psychophysical performance, I will examine how neural activation changes after prolonged exposure to the novel stimuli. Using fMRI metrics, I will measure brain activation during the passive perception of the training stimuli. Is it possible to change auditory cortical activation patterns with short-term acoustic training? How can the categorical learning change the distribution of the firing preferences of neurons in auditory cortical maps? In this way, these studies will address the nature of phonetic categories in the brain, how they are formed, and how they can be manipulated through training. III. BACKGROUND AND SIGNIFICANCE Introducing gradual shifts in formant structure [#] or pitch [#] causes subjects to gradually adapt to the perturbation. That is, they produce speech whose formants or pitch exhibit an opposing shift to counteract ((cancel out)) some of the perturbation. Recent experiments [#] have shown subjects’ ability to compensate for rapidly introduced perturbations, both acoustic and motor. These paradigms have only a small percentage of perturbed trials, ensuring that the perturbation is unexpected and allowing subjects to “reset” in between trials. This resetting prevents subjects from adapting over time and enables a simple comparison of shifted and unshifted trials. fMRI experiments have identified a region of temporal cortex containing “auditory error cells,” neurons that detect the disconnect between expected and unexpected auditory feedback of speech. Tourville [#] found auditory error cells in planum temporal ((and STG)). Phonetic categories are discrete perceptual representations of the sounds of a language. As distinct entities, the categories are an abstract linguistic concept, but there are neural correlates that support their existence in the brain. There has been shown to be more neural activity in response to “poor” phonemes, ambiguous sounds that lie near phonetic category boundaries, than to “good” phonemes, prototypical sounds that lie squarely in the center of the phonetic category.[4,5] This differential activation is reflected behaviorally: it is harder for human subjects to distinguish between pairs of sounds that lie near category centers than pairs of sounds that straddle category boundaries, even when the pairs are separated by the same distance along some acoustic dimension.[6,7] This measurable psychophysical effect is a hallmark of categorical perception. 4 Categorical perception is not specific to the auditory system; it is a phenomenon present in many modalities. Phonetic category perception is interesting because it is particularly sharp and robust within an individual while differing across individuals with different auditory experience. It is usually assessed using stimuli that vary in a single acoustic attribute along a continuum. For example, we can construct a continuum from [t] to [d] by varying the voice onset time, or VOT, from 0-40 ms, evenly spaced in time. Subjects are asked to identify randomly-presented stimuli from the continuum as either [t] or [d]. Instead of perceiving a gradual adjustment from [t] to [d], subjects will report an abrupt change in phoneme perception once a boundary is crossed: in this case, around 20 ms[7]. In addition, reaction time will be highest at this boundary.[8] Finally, as mentioned above, discrimination is best when the two stimuli come from opposite sides of the boundary. This psychophysical evidence demonstrates the sudden perceptual discontinuities imposed by category boundaries: a single sound is heard as either a [d] or a [t], not somewhere in the middle. Phonemes are discrete and categorical: we give names to them, and they allow us to discriminate words with different meanings. By processing acoustic input into phonemes, we ignore small variations that have no phonetic consequence, while paying close attention to the boundary regions, where small changes matter for phonetic identity. A robust representation of these sound categories in the brain allows us to rapidly process and understand incoming speech, and to compare our own speech productions to internal auditory schemata. The finding of increased brain activation to boundary stimuli implies that more neural resources are devoted to processing these ambiguous sounds.[4,5] The formation of phonetic categories is an example of perceptual warping of auditory space that is contingent upon acoustic exposure. By altering the speech feedback signal before it reaches the ear, we can induce perceived errors The participants in this study were mostly unaware of any auditory mischief between microphone and headphones. H This project is novel in several arenas. This kind of data would be extremely valuable in localizing the neural sound maps that correspond to speech. In other words, we can more fully understand how speech sounds are represented in the auditory cortex. These experiments will allow for improvements of the DIVA neural network model of speech acquisition,[12] which is used to posit neural computations that underlie speech processing, and to model various speech pathologies. This work also has important consequences for feedback-based training. 5 IV. RESEARCH DESIGN AND METHODS In order to explore the role of phonetic categories in feedback control, we will examine the neural response to a sudden disruption in the auditory feedback loop, as elicited by an unexpected acoustic shift in real-time. In this experiment, sudden auditory perturbations will occur during subjects’ speech in an attempt to elicit a mismatch between the auditory speech target and the actual realization. We hypothesize this will induce activity in the auditory errors cells that detect this disconnect. Furthermore, we will attempt to distinguish between two populations of error cells: auditory error cells, which become activated during an auditory difference between intended and perceived sounds, and phonetic error cells, hypothesized to be active only when this auditory difference is large enough to cross a phonetic category boundary. That is, the phonetic error cells detect a mismatch between the target phoneme and the perceived phoneme. We will therefore construct opposing shifts in formant space whose magnitudes are identical within a subject but whose boundary Subjects. Subjects will consist of right-handed men and women, ages 18-35, whose first language is American English, and who have normal hearing and speech, as well as no metal in the body (required for imaging eligibility). Subjects whose vowel boundaries are determined to be asymmetric around a center vowel will be preferred, as this enables more assurance Experimental design: overview The experiment can be broken into three phases: an initial speech production test, a speech perception test, and an fMRI session. Test 1: production. For each subject, vowel production data was collected in the carrier consonants [b_d], with ten productions for each of the six vowels [i, ı, ε, æ, α, u]. Median formant values were determined for each vowel. Test 2: perception. Using a real-time formant shifting algorithm (developed by Marc Boucek, Speech Communication Group), eight vowel continua were generated across the F1-F2 spectrum. The production token closest to the median for each vowel was shifted in formant space in ten successive increments towards its neighboring vowels. Each continuum began at the median formant values of one vowel and ended at the median formant values of a neighboring vowel, with one additional token added at each end. The step size between each token in the continuum was constant on the mel scale (a perceptually-derived logarithmic scale, where 1000 mels = 1000 Hz). Furthermore, two continua were generated for each vowel pair: one starting from each end. 6 The randomized tokens were presented four times each to the subjects, who were instructed to categorize what they heard as one of six possible words: bead, bid, bed, bad, bod, or booed. The categorization data was fitted to sigmoid curves to determine an approximate perceptual boundary between the vowels at the continuum endpoints. Test 3: Brain imaging. Using fMRI, I will measure the BOLD response during speech, both with and without perturbation, as well as during a baseline condition. The experiment will be event-related, using a triggering mechanism to coordinate stimulus timing with image acquisitions. Subjects were scanned in a 3T Siemens Tim Trio whole-body MRI machine, located at the Athinoula A. Martinos Imaging Center at McGovern Institute for Brain Research, MIT. The subjects’ speech was recorded by a custom-made MR-safe microphone, and auditory feedback was delivered via insert headphones (Stax SRS-005II electrostatic headphones). Subjects wore supra-aural ear seals surrounded by custom-made foam helmet, affectionately nicknamed the Head Cozy, to insulate them from the noise of the scanner. On each trial, subjects were presented with a word (e.g. “bed”) or a control stimulus (“***”). These visual stimuli were projected in high-contrast white-on-black and displayed on a rear projection screen, visible to the subjects through a mirror mounted above the MRI head coil. Subjects were instructed to read the word aloud and to remain silent on the control trials. Unbeknownst to the subjects, the speech trials were divided into three conditions: One one out of every four test trials, the formants were perturbed either across or within the vowel category of the vocalized word before being fed back to the subjects. This elicits the percept of having mispronounced the trials word; the auditory output the subjects expect to hear does not correspond with the artificially-shifted output of the headphones. In summary, the four conditions experienced by the subject were: 1. baseline: a control condition in which the subject remained silent. 2. no-shift: feedback was 3. shift-within: a within-category shift was applied to the subjects’ speech. 4. shift-across: a cross-category shift, the same magnitude as that of condition 2. After a 2-second delay from stimulus offset, the stimulus presentation software triggered the scanner to collect two volumes of functional data. This trigger is followed by a pause of ((N)) seconds before the next trial to allow for the partial return of the BOLD signal to the steady state. Because the image acquisition is timed to occur several seconds after the stimulus onset, subjects speak in relative silence. The acquisition parameters were typical of those used in previous speech experiments (echo planar imaging, 30 slices covering 7 the entire cortex and cerebellum aligned to the AC-PC line, 5 mm slice thickness, 0 mm gap between slices, flip angle = 90°). [[FIGURE: Timeline for each trial.]] In addition to the functional data, anatomical volumes were collected in order to overlay each subject’s functional data on his or her own brain. Diffusion tensor imaging was also performed to track white matter tracts as they travel between connected regions of the cortex. This data will be used for functional connectivity analyses between brain regions implicated in the task. Data analysis Acoustic data will be analyzed for median formant values will be compared across no-shift, shift-within, and shift-across conditions. The functional data will be analyzed using software packages including SPM and Free surfer. Both voxel-based and surface-based analyses of activation will be carried out. The pre-training activation patterns (passive listening minus resting state) will be compared with the post-training activation patterns (passive listening minus resting state) in terms of location and extent of activation. The behavioral data will also be compared before and after training to look for differences in subjects’ ability to categorize and discriminate stimuli drawn from the non-speech continuum. Expected results Psychophysical results show that in the majority of cases, subjects exhibited categorical perception, as evidenced by the step slope of the sigmoid fit to the data (Figure #). Furthermore, the perceptual boundary within a given subject differed based on the starting point of the continua (that is, the vowel token from which the continua were generated). Even though the tokens were presented randomly, the percept from the original vowel tended to dominate each continuum. Secondly, there is a great deal of variability in the location of vowel boundaries across subjects for a given continuum (Figure #). That is, it takes a larger step size to elicit the perception of phonetic change for some subjects than for others. Preliminary fMRI results show Interpretation. My hypothesis is that the learning of sound categories changes the distribution of the firing preferences of neurons in auditory cortical maps, {{and thus changes the discriminability of sounds from different parts of acoustic space.}} I will interpret a change in cortical activation, in the presence of no change for control stimuli, as evidence that cortical reorganization has occurred in response to the acoustic training. In particular, an increase in activation implies more neural resources devoted to processing a sound near the boundary condition, while a decrease in activation implies a concomitant 8 lessening of resources. This reallocation of neural resources for prototypical stimuli would be consistent with past imaging work.[4] Broadly, success in this protocol would be interpreted as new evidence for the plasticity of adult phonetic representations in cortex, and a promising start to routines that might exploit this ability in second-language learning. Alternative approaches. One potential limitation of this paradigm is the inability to directory compare identical shifts: within a subject, one shift is always in an opposing direction to the other. We have controlled for this by counterbalancing subjects for whom the category-crossing shift is in the upward direction and subjects for whom it is in the downward direction. However, an alternative experiment design contrasts shifts of the same magnitude and direction across subjects. It is possible to create chains of subjects. This design was abandoned when it proved too difficult to construct chains subject and too risky to rely on single subjects’ data to avoid breaking the chain. (During pilot testing, one subject dropped out of the study mid-scan after his “pair” had already been run, rendering both data sets useless.) Potential limitations. The success of these experiments is dependent on the accuracy of the psychophysical test that determines category boundaries. Additionally, categories may be unstable across time. Refining the nature of these stimuli will take up a large portion of my research time. In addition, these experiments do bank on successful cortical reorganization to show a really exciting result. The non-result, in which brain activation cannot be distinguished between the within- and across-category conditions, would still remain an interesting finding, however; this work would be helping to characterize the cortical response due to different types of feedback perturbation and would provide the DIVA model with potential parameters. 9 V. REFERENCES 1. Werker and Tees, 1983. Developmental changes across childhood in the perception of non-native speech sounds. Can J Psychol, 37(2), 278-86. 2. Tees and Weker, 1984. Perceptual flexibility: maintenance or recovery of the ability to discriminate non-native speech. Can J Psychol, 38(4):579-90. 3. Kuhl et al., 1992. Linguistic experience alters phonetic perception in infants by 6 months of age. Science, 255, 606-608 4. Guenther et al., 2004. Representation of Sound Categories in Auditory Cortical Maps. Journal of Speech, Language, and Hearing Research, 47(1) 46-57. 5. Guenther and Gjaja, 1996. The perceptual magnet effect as an emergent property of neural map formation. J Acoust Soc Am, 100, 1111-1121. 6. Harnad, 1986. Psychophysical and cognitive aspects of categorical perception: A critical overview. In Harnad, Stevan (Eds)., Categorical Perception: The Groundwork of Cognition. New York: Cambridge University Press, 1-52. 7. Blumstein et al., 2005. The perception of voice-onset time: An fMRI investigation of phonetic category structure. J Cog Neurosci, 17(9), 1353-1366. 8. Boucher, 2002. Timing relations in speech and the identification of voice-onset times: A stable perceptual boundary for voicing categories across speaking rates. Percept Psychophys, 64, 121–130. 9. Miller et al., 1977. Discrimination and labeling of noise-buzz sequences with varying noise-lead times: An example of categorical perception. J Acoust Soc Am, 60(2), 410-417. 10. Pisoni, 1997. Identification and discrimination of the relative onset of two component tones: Implications for voicing perception in stops. J Acoust Soc Am, 61(5), 1352-1361. 11. Eimas et al., 1971. On infant speech perception and the acquisition of language. In Harnad, Stevan (Eds), Categorical Perception: The Groundwork of Cognition. New York: Cambridge University Press. 12. Guenther et al., 2006. Neural modeling and imaging of the cortical interactions underlying syllable production. Brain and Language, 96 (3), pp. 280-301. 10