Project Description 1 Project Description A. Objectives and Significance The broad goal of the proposed project is to investigate how speaker variability affects speech perception. To that end, this project will focus on the perception of lexical tones. In tone languages, lexical tones are functionally equivalent to consonants and vowels. The primary acoustic correlate of lexical tone is fundamental frequency (F0). Because F0 range varies across speakers, a phonologically high tone produced by one speaker could be acoustically equivalent to a phonologically low tone produced by another speaker. Conversely, a given tone produced by two speakers could be acoustically distinct. How do listeners process such acoustic variability across speakers in speech perception? Research on speaker variability has traditionally focused on vowel perception in the English language. How speaker variability affects the perception of suprasegmental features in non-English languages is largely unknown. Given that tone languages constitute the majority of the world’s languages (Laver, 1994) and that lexical tones employ acoustic cues that are distinct from those for segmental phonemes, the expected significance of the proposed project is to extend current knowledge about how listeners process speaker variability to suprasegmental features of speech. With a crosslinguistic approach and established research paradigms, the proposed project is expected to contribute significantly to our understanding of this foundational issue in speech perception. Objective 1 of the proposed project is to examine the acoustic and perceptual basis of listeners’ ability to estimate F0 height without cues typically present for speaker normalization. Previous research showed that listeners use dynamic F0 contour, external context, and familiarity with speakers to accomplish multispeaker tone perception (Leather, 1983; Moore & Jongman, 1997; Wong & Diehl, 2003). However, recent evidence indicates that relative F0 height in multispeaker speech can be identified without these cues (Bishop & Keating, 2010; Honorof & Whalen, 2005; Lee, 2009; Lee, Lee, & Shr, 2011). Acoustic data from Lee’s (2009) Mandarin study suggest that listeners may take advantage of covariations between F0 and voice quality measures to judge the relative F0 height of tones across speakers. Perceptual data further suggest that detection of speaker gender mediates multispeaker Mandarin tone identification (Lee, Dutton, & Ram, 2010). However, studies using non-tone language (English) materials did not reveal a strong contribution of voice quality (Bishop & Keating, 2010) or gender identification (Honorof & Whalen, 2010), raising the possibility that tone and non-tone language listeners use distinct strategies in F0 height estimation. In addition, because none of the Mandarin tonal contrasts involve F0 height alone, it is not known whether the F0 estimation ability also generalizes to tone languages with intrinsic level-tone contrasts. The answers to these questions have important implications for speech perception theories. First, if listeners of a level-tone language can estimate F0 height reliably without speaker normalization cues, the finding will challenge the long-held assumption that those cues are necessary to process speaker variability in tone perception. Consequently, identifying the acoustic and perceptual basis for the F0 estimation ability will significantly advance our knowledge about processing speaker variability beyond segmental phonemes. Second, investigating F0 height estimation by listeners with and without tone language experience will elucidate how language experience shapes the ability to process speaker variability. In Study 1, Cantonese level tones that contrast only in F0 height will be recorded by multiple speakers. The multispeaker tones will be presented in isolation to listeners without prior exposure to the Project Description 2 speakers. To evaluate the effect of tone language experience, four groups of listeners will be used: Cantonese, Taiwanese (a tone language with level-tone contrasts), Mandarin (a tone language without level-tone contrasts), and English (a non-tone language). Four hypotheses will be tested: (1) Accuracy of F0 height judgment will be obtained to test the hypothesis that listeners are able to identify multispeaker level tones without cues typically present for speaker normalization. (2) Acoustic analysis will be conducted on the tone stimuli to test the hypothesis that covariations exist between F0 and voice quality measures as the basis for the F0 estimation ability. (3) Accuracy of speaker gender judgment will be obtained to test the hypothesis that gender detection contributes to tone identification. (4) Identification performance will be compared among the listener groups to test the hypothesis that F0 estimation ability varies as a function of tone language experience. Objective 2 of the project is to investigate the impact of speaker variability on accessing the form and meaning of spoken words. Although speech perception research has traditionally focused on the identification and discrimination of speech sounds, there has been increasing interest in the study of lexical processing, i.e., the mapping of sound onto the mental lexicon. Understanding the nature of lexical processing is important because the ultimate goal of speech perception is to access words in the mental lexicon. Traditionally, lexical representations are assumed to be abstract phonological codes storing only lexically contrastive information. Under this assumption, the process of recognizing spoken words entails discarding surface acoustic variability (e.g., speaker-specific information) in order to arrive at abstract phonological codes. However, there is emerging evidence that surface acoustic variability is encoded in lexical representations, suggesting that mapping from acoustic signals onto the lexicon does not necessarily involve reduction of acoustic variability (Pisoni, 1997). Although this conclusion is supported by many psycholinguistic studies on lexical representations, there is little evidence on the effect of speaker variability on lexical processing. In addition, it is not known whether the conclusions drawn from non-tone language studies would generalize to tone languages. Because F0 is the primary acoustic cue for identifying both lexical tones and speaker characteristics, the effect of speaker variability on lexical processing may depend on the role of F0 in a language. The answers to these questions will inform models of spoken word recognition and elucidate the role of suprasegmental features in lexical processing. Study 2 will address these questions by examining the effects of speaker variability on repetition priming (Forster & Davis, 1984) and semantic/associative priming (Meyer & Schvaneveldt, 1976). These two types of priming are well-established cognitive phenomena that have been shown to effectively reveal the process of accessing the form and meaning of words. In this study, prime-target pairs that vary systematically in form/meaning relationship and speaker relationship will be presented, and the accuracy and reaction time of listeners’ responses to the targets will be analyzed to test the following hypotheses: (1) The magnitude of repetition and semantic/associative priming will be reduced when the prime and target are produced by different speakers. (2) The reduction of repetition and semantic/associative priming will vary as a function of the time lag between the prime and target, revealing the time course of the speaker variability effect. (3) The reduction of priming by speaker variability will be greater in a tone language because of the extensive use of F0 for lexical distinctions, resulting in elevated sensitivity to speaker characteristics in tone language listeners. These results will be compared to the PI’s pilot work using English materials (Lee & Zhang, submitted). With the crosslinguistic approach and a well-established research paradigm, this study is expected to significantly advance current knowledge of the effect of speaker variability on lexical processing. Project Description 3 In sum, the significance of the proposed project is that it addresses a foundational issue in speech perception—how listeners process speaker variability in acoustic signals—by employing a crosslinguistic approach and well-established paradigms in speech perception, speech acoustics, and lexical processing research. With a focus on suprasegmental features of speech, this project is expected to significantly extend current knowledge on processing speaker variability in segmental structure of speech. The two proposed studies are firmly grounded in the PI’s published research and pilot work. Both studies are also thematically related to the long-term goals of the PI’s research program in understanding the processing of linguistic and nonlinguistic prosody by listeners with various characteristics. B. Background and Preliminary Studies Spoken language comprehension involves mapping acoustic signals onto linguistic representations. Two fundamental questions in the study of spoken language comprehension are the nature of sound/lexical representations and how listeners retrieve the representations from acoustic signals. Acoustic-phonetic research has shown that phonologically identical utterances can vary significantly across speakers. Despite the acoustic variability, listeners are able to understand sounds and words spoken by different speakers. How listeners achieve the perceptual constancy in the face of speaker variability is a foundational issue in speech perception research (Johnson, 2005). Emerging evidence in speech perception research has challenged traditional assumptions about the abstractness of lexical representations (Pisoni, 1997). Although an abundance of research has been devoted to understanding speaker variability in processing segmental phonemes, research on processing speaker variability for suprasegmental features is relatively scarce. This is most likely due to the fact that suprasegmental features do not play a prominent role in phonemic distinctions in English (Cutler, 1997). However, suprasegmental features are an integral part of speech and are closely associated with speaker characteristics. Therefore, processing speaker variability for suprasegmental features is an essential part of speech perception, particularly in languages with extensive use of lexically contrastive prosody. B.1. Perception of relative F0 height from multispeaker tones F0 range varies across speakers. A phonologically high tone produced by a male speaker could be acoustically equivalent to a phonologically low tone produced by a female speaker. On the other hand, a given tone produced by two speakers could be acoustically distinct. Intuitively, judgment of the relative F0 height of a tone intended by a speaker has to be made with reference to the speaker’s F0 range. This observation is supported by research showing that tone perception is contingent on the perceived F0 range of a speaker. Leather (1983) examined identification of Mandarin tones that were synthesized to be lexically ambiguous. The tone stimuli were presented in carrier phrases produced by two speakers. The results showed that stimuli with identical absolute F0 contours were identified as different tones depending on which speaker was heard, indicating the use of perceived range information in tone perception. This finding was replicated by Moore and Jongman (1997), who showed that Mandarin tone stimuli with identical F0 patterns were perceived as high tones in a low F0 carrier phrase produced by one speaker, but as low tones in a high F0 carrier phrase produced by another speaker. Wong and Diehl (2003) examined identification of Cantonese level tones embedded in carrier phrases produced by seven speakers. The results showed that the same target tones were identified Project Description 4 differently depending on which carrier phrase was used. It is clear from these studies that context provides important information about speaker F0 range. Listeners can use contextual information to interpret tones just as they do to interpret vowels (e.g., Ladefoged & Broadbent, 1957). Therefore, removing context should make it difficult to estimate speaker F0 range. The absence of context should particularly compromise identification of level tones, whose contrasts rely solely on relative F0 height. Wong and Diehl (2003) examined identification of three Cantonese level tones that were produced by seven speakers and presented in isolation. The results showed that tone identification was more accurate when the stimuli were blocked by speaker (80%) than when they were mixed across speakers (49%). As expected, identification performance was compromised by the mixedspeaker stimuli (Creelman, 1957; Zhou, Zhang, Lee, & Xu, 2008). However, the fact that identification accuracy still exceeded chance (33%) in both conditions indicates that the absence of context does not make F0 height judgment impossible. It also indicates that there are syllableinternal cues to relative F0 height. However, because listeners in the experiment heard the stimulus set 12 times, they could have learned to estimate the F0 range of the speakers through repeated exposure. To rule out the familiarity account, the PI (Lee, 2009) recorded Mandarin sa syllables with four tones produced by 16 female and 16 male speakers. The syllables were digitally processed such that only the fricative and first six glottal periods of the vowel remained, effectively neutralizing F0 contour contrasts among the tones. These multispeaker, level-F0 stimuli (i.e., no F0 contour cues) were presented in isolation (i.e., no contextual cues) with each stimulus being presented just once (i.e., no familiarity cues). Despite the absence of those cues typically considered necessary for speaker normalization, listeners were able to identify the intended tones with above-chance accuracy. This finding was replicated by Lee and Lee (2010). Lee’s (2009) acoustic analyses further revealed contrasts between the high- and low-onset tones in F0, duration, and two voice quality measures (F1 bandwidth and spectral tilt). Correlation analyses also showed that F0 covaried with the voice quality measures and that tone classification based on F0 height correlated with the voice quality measures. Because the same acoustic measures consistently distinguished female from male stimuli, Lee (2009) proposed that speaker gender detection may be the basis for the F0 height judgment performance. The PI and colleagues (Lee et al., 2010) evaluated this proposal by asking listeners to judge speaker gender from the same set of stimuli used in Lee (2009). The results showed that gender identification accuracy was above chance, suggesting that the ability to judge F0 height from these stimuli is likely due to successful identification of speaker gender as a precursor. Specifically, listeners identify speaker gender based on voice quality and then exploit the covariation between F0 and voice quality for relative F0 height estimation. Once the gender decision is made, pitch class templates stored in memory that are gender-specific can be invoked to compare to the stimulus. Listeners may calibrate their judgments according to the templates, which reflect typical F0s for female and male speakers that listeners have experienced throughout their lives. It has been noted that pitch class templates can be acquired from exposure to prevalent speaking F0s of a linguistic community (Dolson, 1994). If so, F0 height of a tone stimulus could be inferred with the templates as a reference frame. The ability to estimate relative F0 height without speaker normalization cues has also been reported for non-tone language listeners. Honorof and Whalen (2005) showed that English listeners were able to locate an F0 reliably within a speaker’s F0 range without context or prior exposure to a speaker’s voice. Isolated vowel tokens, produced by 20 English speakers with Project Description 5 varying F0s, were presented to listeners to judge where each token was located in the speakers’ F0 ranges. The results showed significant correlations between the perceived F0 location and the actual location in the speakers’ F0 ranges, indicating that the listeners were able to estimate relative F0 height. It was speculated that covariation between F0 and voice quality might have contributed to the identification performance, although this hypothesis was not directly tested in the study. The potential role of voice quality in the F0 estimation ability was evaluated by Bishop and Keating (2010). Their first experiment replicated Honorof and Whalen’s (2005) finding that listeners’ perceived F0 locations correlated with speakers’ actual F0 locations, confirming that the listeners were able to estimate relative F0 height. Statistical modeling showed that F0 is the single most important predictor for the F0 estimation performance. By contrast, acoustic measures of voice quality contributed only minimally to the F0 estimation ability. The second experiment showed that listeners were able to identify speaker gender from the same set of stimuli. Statistical modeling again showed that F0 is the most important predictor for the gender identification performance. In contrast to the first experiment, voice quality contributed to the gender identification performance to a greater extent than it did to F0 height estimation. It was concluded that listeners form expectations about F0s for average male and female speakers through experience, and that they rely on absolute F0 to determine speaker gender, which in turn contributes to relative F0 estimation. That is, voice quality contributes to F0 estimation only indirectly through gender identification. The contribution of gender identification to relative F0 estimation was questioned by Honorof and Whalen (2010). Listeners were asked to judge speaker gender from isolated vowels spoken by 20 English male and female speakers with overlapping F0s. The listeners performed above chance overall, but showed a bias toward hearing high F0s as female and low F0s as male when stimulus F0s were near range extremes. There was no strong evidence for a contribution of voice quality, weakening the argument that voice quality is used to identify speaker gender. The authors propose that the gender identification results are best explained by the listeners’ primary reliance on absolute F0 and secondary reliance on formants or vocal tract information. Findings from these English-based studies provided important information about processing speaker variability in nonlinguistic F0 distinctions. However, it is not clear whether these conclusions could generalize to processing speaker variability in tone languages, in which F0 distinctions are lexically contrastive. Although the PI’s work on Mandarin tones (Lee, 2009; Lee et al., 2010) provided preliminary evidence for the role of voice quality and gender identification in relative F0 estimation, the evidence is weakened by the fact that Mandarin tones do not involve contrasts that rely solely on F0 height. Three of the four Mandarin tones are contour tones. Therefore, listeners can use the contours to infer F0 range and do not need to rely on F0 height to identify tonal contrasts. The most stringent test of the ability to estimate relative F0 height without speaker normalization cues will be to use a tone language that has intrinsic leveltone contrasts. The PI’s recent study on Taiwanese (Lee, et al., 2011) provides preliminary evidence. Taiwanese is a tone language with seven lexical tones, two of which are level tones (high-level and mid-level) contrasting only in F0 height. In the study, the two level tones, produced by 30 male and female speakers, were presented in isolation. Musically-trained listeners were able to identify the tones with above-chance accuracy, indicating that they were able to estimate relative F0 height without cues typically considered necessary for speaker normalization. What remains unclear is whether listeners are able to estimate F0 height in languages with multiple level-tone Project Description 6 contrasts. It is also not known how tone language experience would shape the ability. Study 1 of the proposed project will use Cantonese to further investigate the acoustic and perceptual basis of listeners’ ability to estimate relative F0 height from multispeaker tones. This study is a meaningful and substantial extension of the aforementioned literature in two ways. First, Cantonese has a rich inventory of lexical tones, three of which are level tones contrasting in F0 height only (high-, mid-, and low-level). The multiple tonal contrasts provide ideal testing materials for the proposed research question. Second, the effect of tone language experience on the ability to estimate F0 height will be systematically investigated by using listeners of Cantonese, Taiwanese (a tone language with a level-tone contrast), Mandarin (a tone language without level-tone contrasts), and English (a non-tone language). Findings from the proposed study are expected to significantly advance our knowledge about how speaker variability exhibited in F0 height is processed in speech perception. B.2. Processing speaker variability in accessing word form and meaning The ultimate goal of speech perception is to map acoustic signals onto linguistic representations. Because words/morphemes are the smallest meaningful unit of speech, a fundamental question in speech perception is how listeners extract information from the acoustic signals to access words. It is typically assumed that lexical representations are abstract and contain only lexically contrastive information, implying that surface acoustic variability not directly relevant to lexical identity is discarded during the signal-to-representation mapping process. By this account, speaker-specific information should not be part of the abstract phonological code stored in lexical representations. Consequently, memory for words should not be influenced by variability across speakers. This prediction is supported by Jackson and Morton (1984), who used a long-term priming paradigm to examine word recognition in a “test” phase following a “study” phase. In the study phase, participants listened to words spoken by a female or a male speaker and made semantic judgments about the words. In the test phase, participants listened to the same set of words that were produced by either the same speaker or different speakers. The results showed no differences in word recognition accuracy between words spoken by the same speaker and different speakers. Schacter and Church (1992) asked listeners in the study phase to judge the pitch of the speakers from words spoken by six speakers. In the test phase, listeners were asked to identify words that were spoken by the same or different speakers. Although the study phase was intended to focus the listeners’ attention on speaker-specific information, word recognition accuracy was similar under both the same- and different-speaker conditions, suggesting that speaker variability does not affect word recognition. By contrast, other studies employing the long-term priming paradigm showed that memory for words contains detailed information about speakers and that this information affects word recognition. A speaker effect emerged when Schacter and Church (1992) changed their task in the test phase (from word identification to stem completion) and presented the test stimuli in quiet (instead of in noise). Listeners were asked to propose a multisyllabic word based on auditory input of the first syllable of a word. This “stem completion” was more accurate when the stimuli in the test phase were produced by the same speaker as in the study phase. Church and Schacter (1994) asked listeners to identify words that were low-pass filtered to preserve F0 information, but not formant information. The results showed that word identification was more accurate when the stimuli were produced by the same speaker in the study phase. In addition, identification accuracy was also affected by within-speaker F0 variation. Specifically, when the Project Description 7 same speaker was used in the study and test phase, word identification was more accurate when the F0 between study and test was the same, suggesting that detailed F0 information is encoded in memory for words. Subsequent studies identified additional factors that contribute to the speaker variability effect in long-term priming. Goldinger (1996) showed that word recognition in noise was more accurate when both study and test words were produced by the same speaker. Importantly, the same-speaker advantage in word recognition was observed only when words were encoded at relatively shallow levels of processing (i.e., gender and phoneme classification). By contrast, the speaker effect was attenuated when listeners were asked to encode words at a relatively deep level of processing (i.e., syntactic classification). There is also evidence that the speaker effect is contingent on response format. Luce and Lyons (1998) examined the effect of changing speaker on auditory lexical decision (probing implicit memory) and on old/new judgment (probing explicit memory). In the study phase, listeners made lexical decisions on word and nonword stimuli. In the test phase, listeners either performed lexical decisions again or judged whether each stimulus had been presented in the study phase. The results showed that responses were faster to words that were produced by the same speaker in old/new judgment, but not in lexical decision, indicating that speaker variability affected explicit memory, but not implicit memory. Luce and Lyons (1998) proposed that the speaker effect did not emerge in lexical decision because lexical decision requires rapid responses, whereas the speaker effect takes more time to develop because speaker-specific details in lexical representations are not available as early as the more abstract underlying forms. In other words, when processing is fast, the speaker effect is attenuated. As processing unfolds over time, the speaker effect becomes more pronounced. This time-course hypothesis was further tested in McLennan and Luce (2005). In three longterm priming experiments, the authors manipulated ease of word/nonword discrimination (easy vs. difficult) and response format (speeded vs. delayed response). These manipulations required listeners to process words at different speeds (i.e., difficult distinctions and the delayed response format would result in slower processing). As predicted, the speaker effect was found only when discrimination was difficult and in the delayed response format, suggesting that processing speaker-specific information takes time. Mattys and Liss (2008) showed that the speaker effect on long-term priming was more pronounced in stimuli produced by dysarthric speakers than those produced by normal control speakers. Because processing dysarthric speech is more difficult and presumably requires more time than processing non-dysarthric speech, this result is consistent with the idea that processing speaker-specific information requires time. Vitevitch and Donoso (2011) showed in a lexical decision task that it was easier for listeners to detect a speaker change when nonword stimuli were word-like than when the stimuli were less word-like. Because the word-like stimuli were more difficult to process, they required more processing time than less word-like stimuli. Therefore, the better detection of speaker change in the word-like stimuli suggests that speaker-specific information requires more time to process. Although these long-term priming studies provided evidence for the richness of lexical representations, it is not clear to what extent such a task can reveal the word recognition process itself (Luce & Lyons, 1998). In particular, the process of spoken word recognition is usually characterized by the activation of multiple word candidates and competition among the candidates (Luce & McLennan, 2005). In addition, lexical activation and competition is a timesensitive, “on-line” process that has been extensively examined with various research paradigms (Grosjean & Frauenfelder, 1996). Few studies, however, have examined the role of speaker Project Description 8 variability in these short-term lexical processes. An eye-tracking study by Creel, Aslin, and Tanenhaus (2008) demonstrated the on-line use of speaker information in lexical disambiguation. The stimuli included pairs of words spoken by the same speaker or different speakers. Eyetracking results showed fewer fixations on competitors for words from the different-speaker pairs than for words from the same-speaker pairs, indicating that speaker information facilitated disambiguation of lexical competitors. A similar response pattern was found in a second experiment, in which listeners learned to identify visual shapes from novel labels spoken by the same speaker or different speakers. There were more fixations on words with same-speaker cohorts, but fewer fixations on words with same-speaker competitors. These results indicate that listeners are able to use speaker-specific information encoded in lexical representations during lexical activation and competition. Short-term paired priming is a paradigm that has been used extensively to study lexical processing (see Zwitserlood, 1996, for a summary), but has not been applied to the study of speaker variability. In this paradigm, a prime and a target that are related in certain ways (e.g., phonologically) are presented. The accuracy and latency of responses to the target are used to evaluate the effects of the prime-target relationship, thereby revealing the organization of the lexicon and the processes of accessing lexical representations. Of particular relevance to the proposed study, short-term priming has been used to examine the effects of acoustic variability on spoken word recognition and the time course of the effects. Andruski, Blumstein, and Burton (1994) used short-term paired priming to investigate the effect of subphonetic variability on lexical access. Prime-target pairs that varied in semantic/associative relationship (e.g., kingqueen) were used as stimuli, and participants were asked to make lexical decisions on the targets. The results showed that subtle voice onset time (VOT) differences, which did not change the perception of voicing categories, affected the magnitude of semantic/associative priming, suggesting that detailed acoustic information was not discarded during the recognition process. Rather, subphonetic information affected access to word meaning. In addition, the effect of subphonetic variability appeared at a relatively short lag (50 ms), but not at a long lag (250 ms) between prime and target, indicating that the impact of subphonetic variability dissipated quickly. Andruski et al.’s (1994) findings are significant because they demonstrated that acoustic variability that does not alter word identity could still affect lexical processing. The study also provided evidence for the time course of the subphonetic variability effect. Given the multiple sources of acoustic variability in the speech signal, one might ask whether other sources of acoustic variability, such as speaker variability, would affect lexical processing in a similar way. Using the short-term paired priming paradigm to investigate speaker variability, the following questions could be answered: Does speaker variability affect access to word meaning as subphonetic variability does? Does speaker variability affect access to word form in addition to word meaning? Is the magnitude and time course of the speaker variability effect comparable to those of the subphonetic variability effect? The answers to these questions will further contribute to our understanding of the role of acoustic variability in speech perception. The PI’s pilot work using English materials (Lee & Zhang, submitted) provided preliminary answers to these questions. The effect of speaker variability on accessing the form and meaning of spoken words was evaluated in two short-term paired priming experiments. In the repetition priming experiment, participants listened to repeated or unrelated prime-target pairs, in which the prime and target were produced by the same speaker or different speakers. The results showed that the magnitude of repetition priming was reduced when the prime and target were produced by different speakers, indicating that speaker variability affected access to word form. In the Project Description 9 semantic/associative priming experiment, participants listened to semantically/associatively related or unrelated prime-target pairs, in which the prime and target were produced by the same speaker or different speakers. The results showed that the magnitude of semantic/associative priming was reduced in different-speaker trials, but only for targets produced by one of the speakers. There was no evidence that the speaker variability effect varied as a function of the interstimulus interval between the prime and target (50 ms and 250 ms). These findings suggest that speaker variability affects spoken word recognition, but primarily at a relatively shallow level of processing. Findings from this pilot study are encouraging. First, the study showed that the short-term paired priming paradigm, which has been used extensively in lexical processing research, is also effective in revealing the speaker variability effect. Second, the study showed distinct response patterns between repetition and semantic/associative priming, indicating that speaker variability affects access to word form and meaning differently. Third, there was some indication that the time course of the speaker variability effect differed between repetition and semantic/associative priming, suggesting that there may be distinct time courses between accessing word form and meaning. These positive findings are worthy of further exploration. However, only two speakers (one female and one male) were used in the pilot study and gender was a confounding variable. It is not clear whether the modest number of speakers used in the stimuli were effective in generating a sufficient range of speaker variability for the listeners. Moreover, it is not known whether the conclusions drawn from non-tone language studies would generalize to tone languages. Because F0 is the primary acoustic cue for identifying lexical tones and speaker characteristics, an intriguing question is whether the effect of speaker variability on lexical processing will depend on the role of F0 in a language. These issues will be investigated in Study 2 to systematically examine the effects of speaker variability, depth of processing, time course, and tone language experience on listeners’ ability to access the form and meaning of spoken words. In sum, this literature review indicates that human listeners are capable of dealing with speaker variability in speech perception, but the acoustic and perceptual basis of that ability needs of further exploration. Understanding how speaker variability is processed is one of the foundational issues in speech perception research. Although speaker variability research has traditionally focused on processing segmental structure in the English language, recent work in the PI’s lab has provided preliminary evidence on how listeners process speaker variability in tone languages. By thematically examining the acoustic basis of F0 height perception and the effect of speaker variability on lexical processing, the proposed project will be a substantial and meaningful extension of the speech perception literature on processing speaker variability. C. Experimental Design and Methods Two integrated studies are proposed to address the two objectives of this project. These two studies are distinct in the specific questions addressed, but are thematically related to the overarching issue of processing speaker variability in speech perception. Both studies will employ well-established research paradigms in speech perception, speech acoustics, and spoken word recognition. All of the research paradigms have been successfully applied to the PI’s published studies and pilot work. C.0. General procedures Project Description 10 Speech recordings for experimental stimuli will be made in a sound-treated booth with a condenser microphone connected through a preamplifier and analog-to-digital converter to a computer. The recordings will be digitized with the Brown Lab Interactive Speech System (BLISS, Mertus, 2000) at 20 kHz with 14-bit quantization. Each stimulus item will be identified from the waveform display and saved as an audio file. The peak amplitude will be normalized across items. BLISS will be used for stimulus delivery. All participants will be adults between 18 and 35 years old, and will be screened for normal hearing, defined as pure-tone, air-conducted thresholds of 20 dB HL at octave frequencies from 250 to 8000 Hz. Participants will be tested individually in a sound-treated booth, listening to stimulus items through high-quality headphones. BLISS will be used for response data acquisition. C.1. Study 1: Perception of relative F0 height from multispeaker tones Study 1 will examine the acoustic and perceptual basis of F0 height estimation without typical cues for speaker normalization. Experiment 1.1 will examine Cantonese level tone identification. Experiment 1.2 will examine speaker gender identification from the same set of tone stimuli. Four hypotheses will be tested: (1) Accuracy of F0 height judgment will be obtained to test the hypothesis that listeners are able to identify multispeaker level tones without cues typically considered necessary for speaker normalization. (2) Acoustic analysis will be conducted on the stimuli to test the hypothesis that covariations exist between F0 and voice quality measures as the basis for the F0 estimation ability. (3) Accuracy of speaker gender judgment will be obtained to test the hypothesis that gender detection contributes to tone identification. (4) Identification performance will be compared among the listener groups to test the hypothesis that F0 estimation ability varies as a function of tone language experience. C.1.a. Materials. In Experiment 1.1, two triplets of monosyllabic Cantonese words contrasting in level tones (high-mid-low) will be selected as stimuli, e.g., /fu˦/ (夫 “husband”), /fu˧/ (富 “rich”), and /fu˨/ (負, “negative”); /ʃɪŋ˦/ (升 “rise”), /ʃɪŋ˧/ (勝, “win”), and /ʃɪŋ˨/ (盛 “prosperous”. By consulting relevant sources of word frequency counts (e.g., Kwan, 2001), care will be taken to select tonal minimal pairs that are relatively balanced in word frequency across the three tones to avoid lexical effects on tone identification. The selected words will be recorded by 15 female and 15 male adult native speakers of Cantonese. Two stimulus lists will be generated, with the 90 items from one triplet (30 speakers 3 tones) in the first list, and the 90 items from another triplet in the second list. A listener will be randomly assigned to receive only one list to minimize familiarity with individual speakers. To further minimize familiarity, the 90 stimuli produced by the 30 speakers will be assigned to three blocks such that each block includes only one stimulus from a given speaker. Within each block, the number of female and male speakers will be balanced (15 females and 15 males), as will be the number of the three tones (10 high-tone stimuli, 10 mid-tone stimuli, and 10 low-tone stimuli). The materials for Experiment 1.2 will be identical to those used for Experiment 1.1. C.1.b. Participants. In addition to the 30 Cantonese speakers for stimulus recordings, there will be four groups of 40 participants for a total of 190 participants: Cantonese (40) Taiwanese (40), and Mandarin (40), and English (40) listeners. Half of the participants (20 per group) will participate in Experiment 1.1 and the other half will participate in Experiment 1.2. The Taiwanese and Mandarin participants will be limited to those without prior knowledge of Cantonese, and the English participants will be limited to those without prior knowledge of Project Description 11 lexical tones. The four language groups differ in the type of lexical tone experience: Cantonese has three contrastive level tones (high, mid, and low). Taiwanese has two contrastive level tones (high and mid). Mandarin has one level tone (high) but no tonal contrasts based solely on F0 height. English does not have lexically contrastive tones. In other words, the four groups of participants will vary systematically in the type of tone experience they have. The Cantonese participants (30 speakers for stimulus recording and 40 listeners for the identification experiment) will be recruited from the student population at Hong Kong Baptist University (HKBU) with assistance from Dr. Lian Hee Wee of the Department of English Language and Literature at HKBU. Dr. Wee is the Director of the Phonetics Laboratory at HKBU and an expert in the phonetics and phonology of Chinese languages. Because Cantonese is the primary language spoken in Hong Kong, no difficulties in participant recruitment are expected. The Taiwanese participants will be recruited from the student population at National Cheng Kung University (NCKU) in Tainan, Taiwan with assistance from Dr. Jenn-Yeu Chen of the Department of Psychology/Institute of Cognitive Science at NCKU. Dr. Chen is the Director of the Language, Culture, and Cognition Laboratory at NCKU and an established scholar in psycholinguistics. The PI has a history of successful collaborations with Dr. Chen since 2009. Because NCKU is located in a predominantly Taiwanese-speaking area, no difficulties in participant recruitment are expected. The Mandarin and English participants will be recruited from the student population at Ohio University in Athens, Ohio. There are currently over 800 native speakers of Mandarin at the Athens campus of 21,324 students. Based on the PI’s past experience of conducting language research at the University, no difficulties in participant recruitment are expected. C.1.c. Procedure. In Experiment 1.1, the Cantonese tone identification task, Cantonese participants will be asked to identify the word they hear by pressing response keys labeled with Chinese characters on a response box. Chinese characters are the most common system for representing spoken Cantonese, and all Cantonese participants are expected to be highly familiar with the orthography. The motivation of using Chinese characters as response labels is that a word recognition task is presumably more natural and ecologically relevant to spoken language comprehension than tone labeling task. However, the word recognition task will not be appropriate for the Taiwanese, Mandarin, and English participants because they will have no knowledge of Cantonese. Instead, these three groups of participants will receive a brief tutorial on the high-mid-low distinction among the three Cantonese tones. They will then be asked to judge the relative height of the tones intended by the speakers by pressing response keys labeled with “HIGH”, “MID”, and “LOW”. In Experiment 1.2, the speaker gender identification task, all participants will be asked to judge the gender of the speakers for each stimulus by pressing keys labeled with “FEMALE” or “MALE”. C.1.d. Data analysis. In Experiment 1.1, arcsine-transformed accuracy of Cantonese tone identification will be compared to chance (50%) with one-sample t tests to evaluate the hypothesis that listeners are able to identify tones above chance without cues typically considered necessary for speaker normalization. Acoustic analysis will be conducted of F0, duration, and three voice quality measures including open quotient (amplitude difference between the first and second harmonic), F1 bandwidth (amplitude difference between the first harmonic and the strongest harmonic in the F1 range), and spectral tilt (amplitude difference between the first harmonic and the strongest harmonic in the F3 range) to test the hypothesis that covariations exist between F0 and voice quality measures and that such covariations could be useful for F0 height estimation. In Experiment 1.2, arcsine-transformed accuracy of speaker Project Description 12 gender identification will be compared to chance (50%) with one-sample t tests to evaluate the hypothesis that gender detection underlies tone identification performance. In Experiments 1.1 and 1.2, analyses of variance (ANOVAs) will be conducted on arcsine-transformed accuracy and log-transformed reaction time of tone identification (1.1) and speaker gender identification (1.2) to evaluate the effects of tone language experience. C.2. Study 2: Processing speaker variability in repetition and semantic/associative priming Study 2 will examine the effect of speaker variability in accessing word form and meaning. Four experiments are planned including two types of priming (repetition and semantic/associative priming) and two languages (English and Mandarin). Repetition priming will be used as an index of accessing word form and semantic/associative priming as an index of accessing word meaning. In repetition priming, the processing of a word (target) is facilitated when it is preceded by exactly the same word (prime). The magnitude of repetition priming is usually reduced when the prime and target do not match exactly, indicating that the prime has activated a different lexical form. If speaker information affects access to word form, the magnitude of repetition priming should be reduced when the prime and target are produced by different speakers. In semantic/associative priming, the processing of a target (e.g., queen) is facilitated when it is preceded by a semantically/associatively related prime (e.g., king). If speaker information affects the access to word meaning, the mismatch in speaker between prime and target should result in attenuated semantic/associative priming. The use of both repetition and semantic/associative priming in the proposed study will allow examination of the depth of processing for the speaker variability effect. Second, given that word recognition is a fast, time-sensitive process, the time lag between the prime and target will be manipulated to examine the time course of the speaker variability effect. Third, the priming experiments will be conducted with both English and Mandarin materials to evaluate how tone language experience shapes the speaker variability effect. The following hypotheses will be tested: (1) The magnitude of repetition and semantic/associative priming will be reduced when the prime and target are produced by different speakers. (2) The reduction of repetition and semantic/associative priming will vary as a function of the interstimulus interval between the prime and target, revealing the time course of the speaker variability effect. (3) The reduction of priming by speaker variability will be greater in a tone language because of the extensive use of F0 in signifying lexical distinctions, resulting in elevated sensitivity to speaker characteristics in tone language listeners. C.2.a. Materials. English materials will be used in Experiment 3.1 (repetition priming) and 3.2 (semantic/associative priming), and Mandarin materials will be used in Experiment 3.3 (repetition priming) and 3.4 (semantic/associative priming). In the repetition priming experiments, real word and nonword targets will be paired with four primes varying in word relation (repetition or unrelated) and speaker relation (same or different), as illustrated in Table 1 below. For the English materials, four speakers (two female and two male) of American English from the same town in central Ohio will record the stimuli to avoid potential confound of dialect difference. To minimize potential long-term priming effects from repetition of targets by the same speaker, each target will be produced by a given speaker only once during the experiment. On the other hand, to ensure all prime-target conditions are balanced across all speakers, there will be 20 word targets and 20 nonword targets such that each speaker is assigned to a given condition five times. In sum, each of the 40 targets will be paired with four primes for a total of Project Description 13 160 trials. Table 1: An example of the repetition priming setup Prime (speaker) Target (speaker) queen (1) queen (1) queen (4) queen (2) bell (3) queen (3) bell (2) queen (4) Word relation repetition repetition unrelated unrelated Speaker relation same different same different The design of the semantic/associative priming experiments is identical to the repetition priming experiments except that the prime-target relationship is semantic/associative instead of form-based. In particular, real word and nonword targets will be paired with four primes varying in word relation (semantic/associative or unrelated) and speaker relation (same or different), as illustrated in Table 2 below. The recordings will be made by the same speakers as in the repetition priming experiment, and the same considerations apply regarding no repetition of targets and balanced pairing between speakers and prime-target conditions. As in the repetition priming experiment, each of the 40 targets will be paired with four primes for a total of 160 trials. Table 2: An example of the semantic/associative priming setup Prime (speaker) king (2) king (1) bell (4) bell (3) Target (speaker) queen (2) queen (3) queen (4) queen (1) Word relation semantic semantic unrelated unrelated Speaker relation same different same different The Mandarin materials will be adapted from the PI’s previous work on the effect of processing lexical tone in form and mediated priming (Lee, 2007). As in the English experiments, 20 word targets and 20 nonword targets will be selected. In both repetition and semantic/associative priming tasks, each target will be paired with four primes that are related or unrelated to the target. There will be 160 trials in each experiment. Four speakers of Beijing Mandarin will record the stimuli. The assignment of speakers to prime-target conditions will be identical to the English experiments. Two interstimulus intervals (50 ms and 250 ms), as was used in the PI’s pilot work (Lee & Zhang, submitted), will be used to explore the time course of the priming effects. To avoid potential long-term priming effects due to repetition of targets, interstimulus interval will be a between-subject factor such that participants are not exposed to the same set of stimuli excessively. C.2.b. Participants. There will be 40 participants in each of the four experiments (20 for each interstimulus interval) for a total of 160 participants. Half will be English speakers and the other half will be Mandarin speakers. C.2.c. Procedure. In all four experiments, half of the participants will be randomly assigned to receive one of the two interstimulus interval conditions. Each participant will receive a uniquely randomized order of stimulus presentation. Participants will be instructed to judge whether a target is a real word in the English/Mandarin language or not by pressing keys labeled with “WORD” and “NONWORD”. C.2.d. Data analysis. In all experiments, ANOVAs will be conducted on arcsine- Project Description 14 transformed response accuracy and log-transformed reaction time with word relation (repetition/semantic, unrelated) and speaker relation (same, different) as within-subject variables, interstimulus interval (50 ms, 250 ms) as a between-subject variable, and participants and items as random variables. These analyses will evaluate the effect of speaker variability on accessing word form and meaning, and the time course of the speaker effect. The effect of tone language experience will be evaluated by statistically comparing the response patterns between Experiments 3.1/3.2 and Experiments 3.3/3.4. C.3. Timetable, arranged by experiment and number of participants to be tested: Study 1 Activities Year 1 F Sp 1.1-2 (30) Su Stimulus preparation Data collection (40) Data analysis Results dissemination 2 Stimulus preparation 3.1-2 Data collection (40) (40) Data analysis 3.1-2 Results dissemination Notes. F: fall semester; Sp: spring semester; Su: summer. Year 2 F Sp Su Year 3 F Sp (40) 1.1-2 (40) 1.1-2 1.1-2 (40) 1.1-2 Su 1.1-2 1.1-2 3.1-4 3.1-4 3.3-4 (40) 3.1-2 (40) 3.3-4 3.3-4 D. Broader Impacts D.1. Advance discovery and understanding while promoting teaching, training, and learning This proposed project will provide extensive research training to a postdoctoral researcher and several graduate and undergraduate students aspiring to pursue a research and teaching career. The trainees will be involved in all phases of the project to obtain knowledge and skills for conducting research. The students will also be able to integrate their coursework in speech acoustics, speech perception, and psycholinguistics into many aspects of the project. The undergraduate students will particularly benefit from the research experience. As the instructor of two undergraduate core courses with an average registration of 60 students per class, the PI constantly receives requests from undergraduate students wishing to participate in lab activities. In the last two years, the PI has worked with six undergraduate students, who participated in stimulus preparation, data collection, and regular lab meetings. Two students received the competitive Provost’s Undergraduate Research Fund at Ohio University to conduct research, and all three graduating seniors successfully entered graduate programs in speech-language pathology or audiology. Support from the grant will enable the PI to identify and engage select undergraduate students in integrating basic science research into their education. D.2. Broaden the participation of underrepresented groups Due to the crosslinguistic nature of this project, the postdoctoral researcher and graduate students to be recruited for the project are expected to be fluent in Cantonese, Mandarin, and/or Taiwanese. This project will also allow a substantial number of speakers of the three languages to participate in speech and language research. The trainees and participants will make a Project Description 15 significant contribution to this project on speech perception and spoken word recognition, where these non-English languages are relatively underrepresented. D.3. Enhance the infrastructure for research and education The proposed project involves topics traditionally studied in distinct disciplines including linguistics, cognitive psychology, and speech and hearing sciences. Support from the grant will significantly enhance the PI’s effort in maintaining existing collaborations and identifying new collaborations to achieve the PI’s long-term goal of understanding the processing of linguistic and nonlinguistic prosody by listeners with various characteristics. To that end, the PI has established collaborations with linguists, language teachers, psycholinguists, clinical psychologists, and cognitive ethnomusicologists at Ohio University and beyond. For example, the PI has successfully collaborated with colleagues in Linguistics and the Chinese language program at Ohio University (L. Tao and Z. S. Bond) on a series of studies on the effects of acoustic variability on native and nonnative tone perception. Since 2008, this collaboration has resulted in four published articles and two manuscripts under review, all in high-impact journals. The rapid growth in the number of Mandarin-speaking students and English-speaking students enrolling in Chinese language courses at Ohio University has allowed us these crosslinguistic studies on lexical tone perception. Support from the grant will significantly enhance the PI’s ability to extend this knowledge network to participating international institutions. D.4. Broad dissemination to enhance scientific and technological understanding Findings from this project will be disseminated to relevant scientific communities (acoustics, linguistics, cognitive psychology, and language teaching) through journal publications, conference presentations, and presentations at the PI’s institution and other institutions. All of the PI’s previous studies have been successfully published in high-impact journals such as the Journal of the Acoustical Society of America, Journal of Phonetics, Language and Speech, and Speech Communication. Disseminating findings through these outlets is expected to reach a wide audience including scientists, language teachers, and graduate and undergraduate students. D.5. Benefits to society Due to the basic science nature of the proposed project, direct benefits to society will be limited. However, knowledge of how speaker variability is processed in tone perception may be used as a knowledge base to develop effective means for teaching lexical tones. For English speakers learning tone languages, lexical tones are one of the most difficult aspects to master. Identifying how tone language experience affects the processing of speaker variability in tone perception and word recognition can potentially contribute to improving and strengthening instruction of tone languages. Current approaches to tone language instruction emphasize using idealized speech materials that are clearly articulated and produced by a minimal number of speakers. However, research in speech perception has shown that human listeners are quite adept at compensating for these challenges and that these sources of variability could in fact contribute to robust acquisition of foreign sound contrasts. Therefore, it will be helpful to incorporate into instruction speech materials that are acoustically more challenging (e.g., produced by multiple speakers) to facilitate the transition from idealized materials to connected, conversational speech.