project_description

advertisement
Project Description
1
Project Description
A. Objectives and Significance
The broad goal of the proposed project is to investigate how speaker variability affects
speech perception. To that end, this project will focus on the perception of lexical tones. In tone
languages, lexical tones are functionally equivalent to consonants and vowels. The primary
acoustic correlate of lexical tone is fundamental frequency (F0). Because F0 range varies across
speakers, a phonologically high tone produced by one speaker could be acoustically equivalent to
a phonologically low tone produced by another speaker. Conversely, a given tone produced by
two speakers could be acoustically distinct. How do listeners process such acoustic variability
across speakers in speech perception? Research on speaker variability has traditionally focused
on vowel perception in the English language. How speaker variability affects the perception of
suprasegmental features in non-English languages is largely unknown. Given that tone languages
constitute the majority of the world’s languages (Laver, 1994) and that lexical tones employ
acoustic cues that are distinct from those for segmental phonemes, the expected significance of
the proposed project is to extend current knowledge about how listeners process speaker
variability to suprasegmental features of speech. With a crosslinguistic approach and established
research paradigms, the proposed project is expected to contribute significantly to our
understanding of this foundational issue in speech perception.
Objective 1 of the proposed project is to examine the acoustic and perceptual basis of
listeners’ ability to estimate F0 height without cues typically present for speaker normalization.
Previous research showed that listeners use dynamic F0 contour, external context, and familiarity
with speakers to accomplish multispeaker tone perception (Leather, 1983; Moore & Jongman,
1997; Wong & Diehl, 2003). However, recent evidence indicates that relative F0 height in
multispeaker speech can be identified without these cues (Bishop & Keating, 2010; Honorof &
Whalen, 2005; Lee, 2009; Lee, Lee, & Shr, 2011). Acoustic data from Lee’s (2009) Mandarin
study suggest that listeners may take advantage of covariations between F0 and voice quality
measures to judge the relative F0 height of tones across speakers. Perceptual data further suggest
that detection of speaker gender mediates multispeaker Mandarin tone identification (Lee,
Dutton, & Ram, 2010). However, studies using non-tone language (English) materials did not
reveal a strong contribution of voice quality (Bishop & Keating, 2010) or gender identification
(Honorof & Whalen, 2010), raising the possibility that tone and non-tone language listeners use
distinct strategies in F0 height estimation. In addition, because none of the Mandarin tonal
contrasts involve F0 height alone, it is not known whether the F0 estimation ability also
generalizes to tone languages with intrinsic level-tone contrasts.
The answers to these questions have important implications for speech perception theories.
First, if listeners of a level-tone language can estimate F0 height reliably without speaker
normalization cues, the finding will challenge the long-held assumption that those cues are
necessary to process speaker variability in tone perception. Consequently, identifying the
acoustic and perceptual basis for the F0 estimation ability will significantly advance our
knowledge about processing speaker variability beyond segmental phonemes. Second,
investigating F0 height estimation by listeners with and without tone language experience will
elucidate how language experience shapes the ability to process speaker variability. In Study 1,
Cantonese level tones that contrast only in F0 height will be recorded by multiple speakers. The
multispeaker tones will be presented in isolation to listeners without prior exposure to the
Project Description
2
speakers. To evaluate the effect of tone language experience, four groups of listeners will be
used: Cantonese, Taiwanese (a tone language with level-tone contrasts), Mandarin (a tone
language without level-tone contrasts), and English (a non-tone language). Four hypotheses will
be tested: (1) Accuracy of F0 height judgment will be obtained to test the hypothesis that
listeners are able to identify multispeaker level tones without cues typically present for speaker
normalization. (2) Acoustic analysis will be conducted on the tone stimuli to test the hypothesis
that covariations exist between F0 and voice quality measures as the basis for the F0 estimation
ability. (3) Accuracy of speaker gender judgment will be obtained to test the hypothesis that
gender detection contributes to tone identification. (4) Identification performance will be
compared among the listener groups to test the hypothesis that F0 estimation ability varies as a
function of tone language experience.
Objective 2 of the project is to investigate the impact of speaker variability on accessing the
form and meaning of spoken words. Although speech perception research has traditionally
focused on the identification and discrimination of speech sounds, there has been increasing
interest in the study of lexical processing, i.e., the mapping of sound onto the mental lexicon.
Understanding the nature of lexical processing is important because the ultimate goal of speech
perception is to access words in the mental lexicon. Traditionally, lexical representations are
assumed to be abstract phonological codes storing only lexically contrastive information. Under
this assumption, the process of recognizing spoken words entails discarding surface acoustic
variability (e.g., speaker-specific information) in order to arrive at abstract phonological codes.
However, there is emerging evidence that surface acoustic variability is encoded in lexical
representations, suggesting that mapping from acoustic signals onto the lexicon does not
necessarily involve reduction of acoustic variability (Pisoni, 1997). Although this conclusion is
supported by many psycholinguistic studies on lexical representations, there is little evidence on
the effect of speaker variability on lexical processing. In addition, it is not known whether the
conclusions drawn from non-tone language studies would generalize to tone languages. Because
F0 is the primary acoustic cue for identifying both lexical tones and speaker characteristics, the
effect of speaker variability on lexical processing may depend on the role of F0 in a language.
The answers to these questions will inform models of spoken word recognition and elucidate the
role of suprasegmental features in lexical processing.
Study 2 will address these questions by examining the effects of speaker variability on
repetition priming (Forster & Davis, 1984) and semantic/associative priming (Meyer &
Schvaneveldt, 1976). These two types of priming are well-established cognitive phenomena that
have been shown to effectively reveal the process of accessing the form and meaning of words.
In this study, prime-target pairs that vary systematically in form/meaning relationship and
speaker relationship will be presented, and the accuracy and reaction time of listeners’ responses
to the targets will be analyzed to test the following hypotheses: (1) The magnitude of repetition
and semantic/associative priming will be reduced when the prime and target are produced by
different speakers. (2) The reduction of repetition and semantic/associative priming will vary as a
function of the time lag between the prime and target, revealing the time course of the speaker
variability effect. (3) The reduction of priming by speaker variability will be greater in a tone
language because of the extensive use of F0 for lexical distinctions, resulting in elevated
sensitivity to speaker characteristics in tone language listeners. These results will be compared to
the PI’s pilot work using English materials (Lee & Zhang, submitted). With the crosslinguistic
approach and a well-established research paradigm, this study is expected to significantly
advance current knowledge of the effect of speaker variability on lexical processing.
Project Description
3
In sum, the significance of the proposed project is that it addresses a foundational issue in
speech perception—how listeners process speaker variability in acoustic signals—by employing
a crosslinguistic approach and well-established paradigms in speech perception, speech acoustics,
and lexical processing research. With a focus on suprasegmental features of speech, this project
is expected to significantly extend current knowledge on processing speaker variability in
segmental structure of speech. The two proposed studies are firmly grounded in the PI’s
published research and pilot work. Both studies are also thematically related to the long-term
goals of the PI’s research program in understanding the processing of linguistic and nonlinguistic
prosody by listeners with various characteristics.
B. Background and Preliminary Studies
Spoken language comprehension involves mapping acoustic signals onto linguistic
representations. Two fundamental questions in the study of spoken language comprehension are
the nature of sound/lexical representations and how listeners retrieve the representations from
acoustic signals. Acoustic-phonetic research has shown that phonologically identical utterances
can vary significantly across speakers. Despite the acoustic variability, listeners are able to
understand sounds and words spoken by different speakers. How listeners achieve the perceptual
constancy in the face of speaker variability is a foundational issue in speech perception research
(Johnson, 2005). Emerging evidence in speech perception research has challenged traditional
assumptions about the abstractness of lexical representations (Pisoni, 1997). Although an
abundance of research has been devoted to understanding speaker variability in processing
segmental phonemes, research on processing speaker variability for suprasegmental features is
relatively scarce. This is most likely due to the fact that suprasegmental features do not play a
prominent role in phonemic distinctions in English (Cutler, 1997). However, suprasegmental
features are an integral part of speech and are closely associated with speaker characteristics.
Therefore, processing speaker variability for suprasegmental features is an essential part of
speech perception, particularly in languages with extensive use of lexically contrastive prosody.
B.1. Perception of relative F0 height from multispeaker tones
F0 range varies across speakers. A phonologically high tone produced by a male speaker
could be acoustically equivalent to a phonologically low tone produced by a female speaker. On
the other hand, a given tone produced by two speakers could be acoustically distinct. Intuitively,
judgment of the relative F0 height of a tone intended by a speaker has to be made with reference
to the speaker’s F0 range. This observation is supported by research showing that tone
perception is contingent on the perceived F0 range of a speaker. Leather (1983) examined
identification of Mandarin tones that were synthesized to be lexically ambiguous. The tone
stimuli were presented in carrier phrases produced by two speakers. The results showed that
stimuli with identical absolute F0 contours were identified as different tones depending on which
speaker was heard, indicating the use of perceived range information in tone perception. This
finding was replicated by Moore and Jongman (1997), who showed that Mandarin tone stimuli
with identical F0 patterns were perceived as high tones in a low F0 carrier phrase produced by
one speaker, but as low tones in a high F0 carrier phrase produced by another speaker. Wong and
Diehl (2003) examined identification of Cantonese level tones embedded in carrier phrases
produced by seven speakers. The results showed that the same target tones were identified
Project Description
4
differently depending on which carrier phrase was used.
It is clear from these studies that context provides important information about speaker F0
range. Listeners can use contextual information to interpret tones just as they do to interpret
vowels (e.g., Ladefoged & Broadbent, 1957). Therefore, removing context should make it
difficult to estimate speaker F0 range. The absence of context should particularly compromise
identification of level tones, whose contrasts rely solely on relative F0 height. Wong and Diehl
(2003) examined identification of three Cantonese level tones that were produced by seven
speakers and presented in isolation. The results showed that tone identification was more
accurate when the stimuli were blocked by speaker (80%) than when they were mixed across
speakers (49%). As expected, identification performance was compromised by the mixedspeaker stimuli (Creelman, 1957; Zhou, Zhang, Lee, & Xu, 2008). However, the fact that
identification accuracy still exceeded chance (33%) in both conditions indicates that the absence
of context does not make F0 height judgment impossible. It also indicates that there are syllableinternal cues to relative F0 height. However, because listeners in the experiment heard the
stimulus set 12 times, they could have learned to estimate the F0 range of the speakers through
repeated exposure.
To rule out the familiarity account, the PI (Lee, 2009) recorded Mandarin sa syllables with
four tones produced by 16 female and 16 male speakers. The syllables were digitally processed
such that only the fricative and first six glottal periods of the vowel remained, effectively
neutralizing F0 contour contrasts among the tones. These multispeaker, level-F0 stimuli (i.e., no
F0 contour cues) were presented in isolation (i.e., no contextual cues) with each stimulus being
presented just once (i.e., no familiarity cues). Despite the absence of those cues typically
considered necessary for speaker normalization, listeners were able to identify the intended tones
with above-chance accuracy. This finding was replicated by Lee and Lee (2010). Lee’s (2009)
acoustic analyses further revealed contrasts between the high- and low-onset tones in F0,
duration, and two voice quality measures (F1 bandwidth and spectral tilt). Correlation analyses
also showed that F0 covaried with the voice quality measures and that tone classification based
on F0 height correlated with the voice quality measures. Because the same acoustic measures
consistently distinguished female from male stimuli, Lee (2009) proposed that speaker gender
detection may be the basis for the F0 height judgment performance.
The PI and colleagues (Lee et al., 2010) evaluated this proposal by asking listeners to judge
speaker gender from the same set of stimuli used in Lee (2009). The results showed that gender
identification accuracy was above chance, suggesting that the ability to judge F0 height from
these stimuli is likely due to successful identification of speaker gender as a precursor.
Specifically, listeners identify speaker gender based on voice quality and then exploit the
covariation between F0 and voice quality for relative F0 height estimation. Once the gender
decision is made, pitch class templates stored in memory that are gender-specific can be invoked
to compare to the stimulus. Listeners may calibrate their judgments according to the templates,
which reflect typical F0s for female and male speakers that listeners have experienced
throughout their lives. It has been noted that pitch class templates can be acquired from exposure
to prevalent speaking F0s of a linguistic community (Dolson, 1994). If so, F0 height of a tone
stimulus could be inferred with the templates as a reference frame.
The ability to estimate relative F0 height without speaker normalization cues has also been
reported for non-tone language listeners. Honorof and Whalen (2005) showed that English
listeners were able to locate an F0 reliably within a speaker’s F0 range without context or prior
exposure to a speaker’s voice. Isolated vowel tokens, produced by 20 English speakers with
Project Description
5
varying F0s, were presented to listeners to judge where each token was located in the speakers’
F0 ranges. The results showed significant correlations between the perceived F0 location and the
actual location in the speakers’ F0 ranges, indicating that the listeners were able to estimate
relative F0 height. It was speculated that covariation between F0 and voice quality might have
contributed to the identification performance, although this hypothesis was not directly tested in
the study.
The potential role of voice quality in the F0 estimation ability was evaluated by Bishop and
Keating (2010). Their first experiment replicated Honorof and Whalen’s (2005) finding that
listeners’ perceived F0 locations correlated with speakers’ actual F0 locations, confirming that
the listeners were able to estimate relative F0 height. Statistical modeling showed that F0 is the
single most important predictor for the F0 estimation performance. By contrast, acoustic
measures of voice quality contributed only minimally to the F0 estimation ability. The second
experiment showed that listeners were able to identify speaker gender from the same set of
stimuli. Statistical modeling again showed that F0 is the most important predictor for the gender
identification performance. In contrast to the first experiment, voice quality contributed to the
gender identification performance to a greater extent than it did to F0 height estimation. It was
concluded that listeners form expectations about F0s for average male and female speakers
through experience, and that they rely on absolute F0 to determine speaker gender, which in turn
contributes to relative F0 estimation. That is, voice quality contributes to F0 estimation only
indirectly through gender identification.
The contribution of gender identification to relative F0 estimation was questioned by
Honorof and Whalen (2010). Listeners were asked to judge speaker gender from isolated vowels
spoken by 20 English male and female speakers with overlapping F0s. The listeners performed
above chance overall, but showed a bias toward hearing high F0s as female and low F0s as male
when stimulus F0s were near range extremes. There was no strong evidence for a contribution of
voice quality, weakening the argument that voice quality is used to identify speaker gender. The
authors propose that the gender identification results are best explained by the listeners’ primary
reliance on absolute F0 and secondary reliance on formants or vocal tract information.
Findings from these English-based studies provided important information about processing
speaker variability in nonlinguistic F0 distinctions. However, it is not clear whether these
conclusions could generalize to processing speaker variability in tone languages, in which F0
distinctions are lexically contrastive. Although the PI’s work on Mandarin tones (Lee, 2009; Lee
et al., 2010) provided preliminary evidence for the role of voice quality and gender identification
in relative F0 estimation, the evidence is weakened by the fact that Mandarin tones do not
involve contrasts that rely solely on F0 height. Three of the four Mandarin tones are contour
tones. Therefore, listeners can use the contours to infer F0 range and do not need to rely on F0
height to identify tonal contrasts. The most stringent test of the ability to estimate relative F0
height without speaker normalization cues will be to use a tone language that has intrinsic leveltone contrasts.
The PI’s recent study on Taiwanese (Lee, et al., 2011) provides preliminary evidence.
Taiwanese is a tone language with seven lexical tones, two of which are level tones (high-level
and mid-level) contrasting only in F0 height. In the study, the two level tones, produced by 30
male and female speakers, were presented in isolation. Musically-trained listeners were able to
identify the tones with above-chance accuracy, indicating that they were able to estimate relative
F0 height without cues typically considered necessary for speaker normalization. What remains
unclear is whether listeners are able to estimate F0 height in languages with multiple level-tone
Project Description
6
contrasts. It is also not known how tone language experience would shape the ability.
Study 1 of the proposed project will use Cantonese to further investigate the acoustic and
perceptual basis of listeners’ ability to estimate relative F0 height from multispeaker tones. This
study is a meaningful and substantial extension of the aforementioned literature in two ways.
First, Cantonese has a rich inventory of lexical tones, three of which are level tones contrasting
in F0 height only (high-, mid-, and low-level). The multiple tonal contrasts provide ideal testing
materials for the proposed research question. Second, the effect of tone language experience on
the ability to estimate F0 height will be systematically investigated by using listeners of
Cantonese, Taiwanese (a tone language with a level-tone contrast), Mandarin (a tone language
without level-tone contrasts), and English (a non-tone language). Findings from the proposed
study are expected to significantly advance our knowledge about how speaker variability
exhibited in F0 height is processed in speech perception.
B.2. Processing speaker variability in accessing word form and meaning
The ultimate goal of speech perception is to map acoustic signals onto linguistic
representations. Because words/morphemes are the smallest meaningful unit of speech, a
fundamental question in speech perception is how listeners extract information from the acoustic
signals to access words. It is typically assumed that lexical representations are abstract and
contain only lexically contrastive information, implying that surface acoustic variability not
directly relevant to lexical identity is discarded during the signal-to-representation mapping
process. By this account, speaker-specific information should not be part of the abstract
phonological code stored in lexical representations. Consequently, memory for words should not
be influenced by variability across speakers. This prediction is supported by Jackson and Morton
(1984), who used a long-term priming paradigm to examine word recognition in a “test” phase
following a “study” phase. In the study phase, participants listened to words spoken by a female
or a male speaker and made semantic judgments about the words. In the test phase, participants
listened to the same set of words that were produced by either the same speaker or different
speakers. The results showed no differences in word recognition accuracy between words spoken
by the same speaker and different speakers. Schacter and Church (1992) asked listeners in the
study phase to judge the pitch of the speakers from words spoken by six speakers. In the test
phase, listeners were asked to identify words that were spoken by the same or different speakers.
Although the study phase was intended to focus the listeners’ attention on speaker-specific
information, word recognition accuracy was similar under both the same- and different-speaker
conditions, suggesting that speaker variability does not affect word recognition.
By contrast, other studies employing the long-term priming paradigm showed that memory
for words contains detailed information about speakers and that this information affects word
recognition. A speaker effect emerged when Schacter and Church (1992) changed their task in
the test phase (from word identification to stem completion) and presented the test stimuli in
quiet (instead of in noise). Listeners were asked to propose a multisyllabic word based on
auditory input of the first syllable of a word. This “stem completion” was more accurate when
the stimuli in the test phase were produced by the same speaker as in the study phase. Church
and Schacter (1994) asked listeners to identify words that were low-pass filtered to preserve F0
information, but not formant information. The results showed that word identification was more
accurate when the stimuli were produced by the same speaker in the study phase. In addition,
identification accuracy was also affected by within-speaker F0 variation. Specifically, when the
Project Description
7
same speaker was used in the study and test phase, word identification was more accurate when
the F0 between study and test was the same, suggesting that detailed F0 information is encoded
in memory for words.
Subsequent studies identified additional factors that contribute to the speaker variability
effect in long-term priming. Goldinger (1996) showed that word recognition in noise was more
accurate when both study and test words were produced by the same speaker. Importantly, the
same-speaker advantage in word recognition was observed only when words were encoded at
relatively shallow levels of processing (i.e., gender and phoneme classification). By contrast, the
speaker effect was attenuated when listeners were asked to encode words at a relatively deep
level of processing (i.e., syntactic classification).
There is also evidence that the speaker effect is contingent on response format. Luce and
Lyons (1998) examined the effect of changing speaker on auditory lexical decision (probing
implicit memory) and on old/new judgment (probing explicit memory). In the study phase,
listeners made lexical decisions on word and nonword stimuli. In the test phase, listeners either
performed lexical decisions again or judged whether each stimulus had been presented in the
study phase. The results showed that responses were faster to words that were produced by the
same speaker in old/new judgment, but not in lexical decision, indicating that speaker variability
affected explicit memory, but not implicit memory. Luce and Lyons (1998) proposed that the
speaker effect did not emerge in lexical decision because lexical decision requires rapid
responses, whereas the speaker effect takes more time to develop because speaker-specific
details in lexical representations are not available as early as the more abstract underlying forms.
In other words, when processing is fast, the speaker effect is attenuated. As processing unfolds
over time, the speaker effect becomes more pronounced.
This time-course hypothesis was further tested in McLennan and Luce (2005). In three longterm priming experiments, the authors manipulated ease of word/nonword discrimination (easy
vs. difficult) and response format (speeded vs. delayed response). These manipulations required
listeners to process words at different speeds (i.e., difficult distinctions and the delayed response
format would result in slower processing). As predicted, the speaker effect was found only when
discrimination was difficult and in the delayed response format, suggesting that processing
speaker-specific information takes time. Mattys and Liss (2008) showed that the speaker effect
on long-term priming was more pronounced in stimuli produced by dysarthric speakers than
those produced by normal control speakers. Because processing dysarthric speech is more
difficult and presumably requires more time than processing non-dysarthric speech, this result is
consistent with the idea that processing speaker-specific information requires time. Vitevitch and
Donoso (2011) showed in a lexical decision task that it was easier for listeners to detect a
speaker change when nonword stimuli were word-like than when the stimuli were less word-like.
Because the word-like stimuli were more difficult to process, they required more processing time
than less word-like stimuli. Therefore, the better detection of speaker change in the word-like
stimuli suggests that speaker-specific information requires more time to process.
Although these long-term priming studies provided evidence for the richness of lexical
representations, it is not clear to what extent such a task can reveal the word recognition process
itself (Luce & Lyons, 1998). In particular, the process of spoken word recognition is usually
characterized by the activation of multiple word candidates and competition among the
candidates (Luce & McLennan, 2005). In addition, lexical activation and competition is a timesensitive, “on-line” process that has been extensively examined with various research paradigms
(Grosjean & Frauenfelder, 1996). Few studies, however, have examined the role of speaker
Project Description
8
variability in these short-term lexical processes. An eye-tracking study by Creel, Aslin, and
Tanenhaus (2008) demonstrated the on-line use of speaker information in lexical disambiguation.
The stimuli included pairs of words spoken by the same speaker or different speakers. Eyetracking results showed fewer fixations on competitors for words from the different-speaker pairs
than for words from the same-speaker pairs, indicating that speaker information facilitated
disambiguation of lexical competitors. A similar response pattern was found in a second
experiment, in which listeners learned to identify visual shapes from novel labels spoken by the
same speaker or different speakers. There were more fixations on words with same-speaker
cohorts, but fewer fixations on words with same-speaker competitors. These results indicate that
listeners are able to use speaker-specific information encoded in lexical representations during
lexical activation and competition.
Short-term paired priming is a paradigm that has been used extensively to study lexical
processing (see Zwitserlood, 1996, for a summary), but has not been applied to the study of
speaker variability. In this paradigm, a prime and a target that are related in certain ways (e.g.,
phonologically) are presented. The accuracy and latency of responses to the target are used to
evaluate the effects of the prime-target relationship, thereby revealing the organization of the
lexicon and the processes of accessing lexical representations. Of particular relevance to the
proposed study, short-term priming has been used to examine the effects of acoustic variability
on spoken word recognition and the time course of the effects. Andruski, Blumstein, and Burton
(1994) used short-term paired priming to investigate the effect of subphonetic variability on
lexical access. Prime-target pairs that varied in semantic/associative relationship (e.g., kingqueen) were used as stimuli, and participants were asked to make lexical decisions on the targets.
The results showed that subtle voice onset time (VOT) differences, which did not change the
perception of voicing categories, affected the magnitude of semantic/associative priming,
suggesting that detailed acoustic information was not discarded during the recognition process.
Rather, subphonetic information affected access to word meaning. In addition, the effect of
subphonetic variability appeared at a relatively short lag (50 ms), but not at a long lag (250 ms)
between prime and target, indicating that the impact of subphonetic variability dissipated quickly.
Andruski et al.’s (1994) findings are significant because they demonstrated that acoustic
variability that does not alter word identity could still affect lexical processing. The study also
provided evidence for the time course of the subphonetic variability effect. Given the multiple
sources of acoustic variability in the speech signal, one might ask whether other sources of
acoustic variability, such as speaker variability, would affect lexical processing in a similar way.
Using the short-term paired priming paradigm to investigate speaker variability, the following
questions could be answered: Does speaker variability affect access to word meaning as
subphonetic variability does? Does speaker variability affect access to word form in addition to
word meaning? Is the magnitude and time course of the speaker variability effect comparable to
those of the subphonetic variability effect? The answers to these questions will further contribute
to our understanding of the role of acoustic variability in speech perception.
The PI’s pilot work using English materials (Lee & Zhang, submitted) provided preliminary
answers to these questions. The effect of speaker variability on accessing the form and meaning
of spoken words was evaluated in two short-term paired priming experiments. In the repetition
priming experiment, participants listened to repeated or unrelated prime-target pairs, in which the
prime and target were produced by the same speaker or different speakers. The results showed
that the magnitude of repetition priming was reduced when the prime and target were produced
by different speakers, indicating that speaker variability affected access to word form. In the
Project Description
9
semantic/associative priming experiment, participants listened to semantically/associatively
related or unrelated prime-target pairs, in which the prime and target were produced by the same
speaker or different speakers. The results showed that the magnitude of semantic/associative
priming was reduced in different-speaker trials, but only for targets produced by one of the
speakers. There was no evidence that the speaker variability effect varied as a function of the
interstimulus interval between the prime and target (50 ms and 250 ms). These findings suggest
that speaker variability affects spoken word recognition, but primarily at a relatively shallow
level of processing.
Findings from this pilot study are encouraging. First, the study showed that the short-term
paired priming paradigm, which has been used extensively in lexical processing research, is also
effective in revealing the speaker variability effect. Second, the study showed distinct response
patterns between repetition and semantic/associative priming, indicating that speaker variability
affects access to word form and meaning differently. Third, there was some indication that the
time course of the speaker variability effect differed between repetition and semantic/associative
priming, suggesting that there may be distinct time courses between accessing word form and
meaning. These positive findings are worthy of further exploration.
However, only two speakers (one female and one male) were used in the pilot study and
gender was a confounding variable. It is not clear whether the modest number of speakers used
in the stimuli were effective in generating a sufficient range of speaker variability for the
listeners. Moreover, it is not known whether the conclusions drawn from non-tone language
studies would generalize to tone languages. Because F0 is the primary acoustic cue for
identifying lexical tones and speaker characteristics, an intriguing question is whether the effect
of speaker variability on lexical processing will depend on the role of F0 in a language. These
issues will be investigated in Study 2 to systematically examine the effects of speaker variability,
depth of processing, time course, and tone language experience on listeners’ ability to access the
form and meaning of spoken words.
In sum, this literature review indicates that human listeners are capable of dealing with
speaker variability in speech perception, but the acoustic and perceptual basis of that ability
needs of further exploration. Understanding how speaker variability is processed is one of the
foundational issues in speech perception research. Although speaker variability research has
traditionally focused on processing segmental structure in the English language, recent work in
the PI’s lab has provided preliminary evidence on how listeners process speaker variability in
tone languages. By thematically examining the acoustic basis of F0 height perception and the
effect of speaker variability on lexical processing, the proposed project will be a substantial and
meaningful extension of the speech perception literature on processing speaker variability.
C. Experimental Design and Methods
Two integrated studies are proposed to address the two objectives of this project. These two
studies are distinct in the specific questions addressed, but are thematically related to the
overarching issue of processing speaker variability in speech perception. Both studies will
employ well-established research paradigms in speech perception, speech acoustics, and spoken
word recognition. All of the research paradigms have been successfully applied to the PI’s
published studies and pilot work.
C.0. General procedures
Project Description
10
Speech recordings for experimental stimuli will be made in a sound-treated booth with a
condenser microphone connected through a preamplifier and analog-to-digital converter to a
computer. The recordings will be digitized with the Brown Lab Interactive Speech System
(BLISS, Mertus, 2000) at 20 kHz with 14-bit quantization. Each stimulus item will be identified
from the waveform display and saved as an audio file. The peak amplitude will be normalized
across items. BLISS will be used for stimulus delivery. All participants will be adults between
18 and 35 years old, and will be screened for normal hearing, defined as pure-tone, air-conducted
thresholds of  20 dB HL at octave frequencies from 250 to 8000 Hz. Participants will be tested
individually in a sound-treated booth, listening to stimulus items through high-quality
headphones. BLISS will be used for response data acquisition.
C.1. Study 1: Perception of relative F0 height from multispeaker tones
Study 1 will examine the acoustic and perceptual basis of F0 height estimation without
typical cues for speaker normalization. Experiment 1.1 will examine Cantonese level tone
identification. Experiment 1.2 will examine speaker gender identification from the same set of
tone stimuli. Four hypotheses will be tested: (1) Accuracy of F0 height judgment will be
obtained to test the hypothesis that listeners are able to identify multispeaker level tones without
cues typically considered necessary for speaker normalization. (2) Acoustic analysis will be
conducted on the stimuli to test the hypothesis that covariations exist between F0 and voice
quality measures as the basis for the F0 estimation ability. (3) Accuracy of speaker gender
judgment will be obtained to test the hypothesis that gender detection contributes to tone
identification. (4) Identification performance will be compared among the listener groups to test
the hypothesis that F0 estimation ability varies as a function of tone language experience.
C.1.a. Materials. In Experiment 1.1, two triplets of monosyllabic Cantonese words
contrasting in level tones (high-mid-low) will be selected as stimuli, e.g., /fu˦/ (夫 “husband”),
/fu˧/ (富 “rich”), and /fu˨/ (負, “negative”); /ʃɪŋ˦/ (升 “rise”), /ʃɪŋ˧/ (勝, “win”), and
/ʃɪŋ˨/ (盛 “prosperous”. By consulting relevant sources of word frequency counts (e.g.,
Kwan, 2001), care will be taken to select tonal minimal pairs that are relatively balanced in word
frequency across the three tones to avoid lexical effects on tone identification. The selected
words will be recorded by 15 female and 15 male adult native speakers of Cantonese. Two
stimulus lists will be generated, with the 90 items from one triplet (30 speakers  3 tones) in the
first list, and the 90 items from another triplet in the second list. A listener will be randomly
assigned to receive only one list to minimize familiarity with individual speakers. To further
minimize familiarity, the 90 stimuli produced by the 30 speakers will be assigned to three blocks
such that each block includes only one stimulus from a given speaker. Within each block, the
number of female and male speakers will be balanced (15 females and 15 males), as will be the
number of the three tones (10 high-tone stimuli, 10 mid-tone stimuli, and 10 low-tone stimuli).
The materials for Experiment 1.2 will be identical to those used for Experiment 1.1.
C.1.b. Participants. In addition to the 30 Cantonese speakers for stimulus recordings, there
will be four groups of 40 participants for a total of 190 participants: Cantonese (40) Taiwanese
(40), and Mandarin (40), and English (40) listeners. Half of the participants (20 per group) will
participate in Experiment 1.1 and the other half will participate in Experiment 1.2. The
Taiwanese and Mandarin participants will be limited to those without prior knowledge of
Cantonese, and the English participants will be limited to those without prior knowledge of
Project Description
11
lexical tones. The four language groups differ in the type of lexical tone experience: Cantonese
has three contrastive level tones (high, mid, and low). Taiwanese has two contrastive level tones
(high and mid). Mandarin has one level tone (high) but no tonal contrasts based solely on F0
height. English does not have lexically contrastive tones. In other words, the four groups of
participants will vary systematically in the type of tone experience they have.
The Cantonese participants (30 speakers for stimulus recording and 40 listeners for the
identification experiment) will be recruited from the student population at Hong Kong Baptist
University (HKBU) with assistance from Dr. Lian Hee Wee of the Department of English
Language and Literature at HKBU. Dr. Wee is the Director of the Phonetics Laboratory at
HKBU and an expert in the phonetics and phonology of Chinese languages. Because Cantonese
is the primary language spoken in Hong Kong, no difficulties in participant recruitment are
expected. The Taiwanese participants will be recruited from the student population at National
Cheng Kung University (NCKU) in Tainan, Taiwan with assistance from Dr. Jenn-Yeu Chen of
the Department of Psychology/Institute of Cognitive Science at NCKU. Dr. Chen is the Director
of the Language, Culture, and Cognition Laboratory at NCKU and an established scholar in
psycholinguistics. The PI has a history of successful collaborations with Dr. Chen since 2009.
Because NCKU is located in a predominantly Taiwanese-speaking area, no difficulties in
participant recruitment are expected. The Mandarin and English participants will be recruited
from the student population at Ohio University in Athens, Ohio. There are currently over 800
native speakers of Mandarin at the Athens campus of 21,324 students. Based on the PI’s past
experience of conducting language research at the University, no difficulties in participant
recruitment are expected.
C.1.c. Procedure. In Experiment 1.1, the Cantonese tone identification task, Cantonese
participants will be asked to identify the word they hear by pressing response keys labeled with
Chinese characters on a response box. Chinese characters are the most common system for
representing spoken Cantonese, and all Cantonese participants are expected to be highly familiar
with the orthography. The motivation of using Chinese characters as response labels is that a
word recognition task is presumably more natural and ecologically relevant to spoken language
comprehension than tone labeling task. However, the word recognition task will not be
appropriate for the Taiwanese, Mandarin, and English participants because they will have no
knowledge of Cantonese. Instead, these three groups of participants will receive a brief tutorial
on the high-mid-low distinction among the three Cantonese tones. They will then be asked to
judge the relative height of the tones intended by the speakers by pressing response keys labeled
with “HIGH”, “MID”, and “LOW”. In Experiment 1.2, the speaker gender identification task,
all participants will be asked to judge the gender of the speakers for each stimulus by pressing
keys labeled with “FEMALE” or “MALE”.
C.1.d. Data analysis. In Experiment 1.1, arcsine-transformed accuracy of Cantonese tone
identification will be compared to chance (50%) with one-sample t tests to evaluate the
hypothesis that listeners are able to identify tones above chance without cues typically
considered necessary for speaker normalization. Acoustic analysis will be conducted of F0,
duration, and three voice quality measures including open quotient (amplitude difference
between the first and second harmonic), F1 bandwidth (amplitude difference between the first
harmonic and the strongest harmonic in the F1 range), and spectral tilt (amplitude difference
between the first harmonic and the strongest harmonic in the F3 range) to test the hypothesis that
covariations exist between F0 and voice quality measures and that such covariations could be
useful for F0 height estimation. In Experiment 1.2, arcsine-transformed accuracy of speaker
Project Description
12
gender identification will be compared to chance (50%) with one-sample t tests to evaluate the
hypothesis that gender detection underlies tone identification performance. In Experiments 1.1
and 1.2, analyses of variance (ANOVAs) will be conducted on arcsine-transformed accuracy and
log-transformed reaction time of tone identification (1.1) and speaker gender identification (1.2)
to evaluate the effects of tone language experience.
C.2. Study 2: Processing speaker variability in repetition and semantic/associative priming
Study 2 will examine the effect of speaker variability in accessing word form and meaning.
Four experiments are planned including two types of priming (repetition and
semantic/associative priming) and two languages (English and Mandarin). Repetition priming
will be used as an index of accessing word form and semantic/associative priming as an index of
accessing word meaning. In repetition priming, the processing of a word (target) is facilitated
when it is preceded by exactly the same word (prime). The magnitude of repetition priming is
usually reduced when the prime and target do not match exactly, indicating that the prime has
activated a different lexical form. If speaker information affects access to word form, the
magnitude of repetition priming should be reduced when the prime and target are produced by
different speakers. In semantic/associative priming, the processing of a target (e.g., queen) is
facilitated when it is preceded by a semantically/associatively related prime (e.g., king). If
speaker information affects the access to word meaning, the mismatch in speaker between prime
and target should result in attenuated semantic/associative priming.
The use of both repetition and semantic/associative priming in the proposed study will allow
examination of the depth of processing for the speaker variability effect. Second, given that
word recognition is a fast, time-sensitive process, the time lag between the prime and target will
be manipulated to examine the time course of the speaker variability effect. Third, the priming
experiments will be conducted with both English and Mandarin materials to evaluate how tone
language experience shapes the speaker variability effect. The following hypotheses will be
tested: (1) The magnitude of repetition and semantic/associative priming will be reduced when
the prime and target are produced by different speakers. (2) The reduction of repetition and
semantic/associative priming will vary as a function of the interstimulus interval between the
prime and target, revealing the time course of the speaker variability effect. (3) The reduction of
priming by speaker variability will be greater in a tone language because of the extensive use of
F0 in signifying lexical distinctions, resulting in elevated sensitivity to speaker characteristics in
tone language listeners.
C.2.a. Materials. English materials will be used in Experiment 3.1 (repetition priming) and
3.2 (semantic/associative priming), and Mandarin materials will be used in Experiment 3.3
(repetition priming) and 3.4 (semantic/associative priming). In the repetition priming
experiments, real word and nonword targets will be paired with four primes varying in word
relation (repetition or unrelated) and speaker relation (same or different), as illustrated in Table 1
below. For the English materials, four speakers (two female and two male) of American English
from the same town in central Ohio will record the stimuli to avoid potential confound of dialect
difference. To minimize potential long-term priming effects from repetition of targets by the
same speaker, each target will be produced by a given speaker only once during the experiment.
On the other hand, to ensure all prime-target conditions are balanced across all speakers, there
will be 20 word targets and 20 nonword targets such that each speaker is assigned to a given
condition five times. In sum, each of the 40 targets will be paired with four primes for a total of
Project Description
13
160 trials.
Table 1: An example of the repetition priming setup
Prime (speaker) Target (speaker)
queen (1)
queen (1)
queen (4)
queen (2)
bell (3)
queen (3)
bell (2)
queen (4)
Word relation
repetition
repetition
unrelated
unrelated
Speaker relation
same
different
same
different
The design of the semantic/associative priming experiments is identical to the repetition
priming experiments except that the prime-target relationship is semantic/associative instead of
form-based. In particular, real word and nonword targets will be paired with four primes varying
in word relation (semantic/associative or unrelated) and speaker relation (same or different), as
illustrated in Table 2 below. The recordings will be made by the same speakers as in the
repetition priming experiment, and the same considerations apply regarding no repetition of
targets and balanced pairing between speakers and prime-target conditions. As in the repetition
priming experiment, each of the 40 targets will be paired with four primes for a total of 160 trials.
Table 2: An example of the semantic/associative priming setup
Prime (speaker)
king (2)
king (1)
bell (4)
bell (3)
Target (speaker)
queen (2)
queen (3)
queen (4)
queen (1)
Word relation
semantic
semantic
unrelated
unrelated
Speaker relation
same
different
same
different
The Mandarin materials will be adapted from the PI’s previous work on the effect of
processing lexical tone in form and mediated priming (Lee, 2007). As in the English experiments,
20 word targets and 20 nonword targets will be selected. In both repetition and
semantic/associative priming tasks, each target will be paired with four primes that are related or
unrelated to the target. There will be 160 trials in each experiment. Four speakers of Beijing
Mandarin will record the stimuli. The assignment of speakers to prime-target conditions will be
identical to the English experiments.
Two interstimulus intervals (50 ms and 250 ms), as was used in the PI’s pilot work (Lee &
Zhang, submitted), will be used to explore the time course of the priming effects. To avoid
potential long-term priming effects due to repetition of targets, interstimulus interval will be a
between-subject factor such that participants are not exposed to the same set of stimuli
excessively.
C.2.b. Participants. There will be 40 participants in each of the four experiments (20 for
each interstimulus interval) for a total of 160 participants. Half will be English speakers and the
other half will be Mandarin speakers.
C.2.c. Procedure. In all four experiments, half of the participants will be randomly assigned
to receive one of the two interstimulus interval conditions. Each participant will receive a
uniquely randomized order of stimulus presentation. Participants will be instructed to judge
whether a target is a real word in the English/Mandarin language or not by pressing keys labeled
with “WORD” and “NONWORD”.
C.2.d. Data analysis. In all experiments, ANOVAs will be conducted on arcsine-
Project Description
14
transformed response accuracy and log-transformed reaction time with word relation
(repetition/semantic, unrelated) and speaker relation (same, different) as within-subject variables,
interstimulus interval (50 ms, 250 ms) as a between-subject variable, and participants and items
as random variables. These analyses will evaluate the effect of speaker variability on accessing
word form and meaning, and the time course of the speaker effect. The effect of tone language
experience will be evaluated by statistically comparing the response patterns between
Experiments 3.1/3.2 and Experiments 3.3/3.4.
C.3. Timetable, arranged by experiment and number of participants to be tested:
Study
1
Activities
Year 1
F
Sp
1.1-2 (30)
Su
Stimulus preparation
Data collection
(40)
Data analysis
Results dissemination
2
Stimulus preparation
3.1-2
Data collection
(40)
(40)
Data analysis
3.1-2
Results dissemination
Notes. F: fall semester; Sp: spring semester; Su: summer.
Year 2
F
Sp
Su
Year 3
F
Sp
(40)
1.1-2
(40)
1.1-2
1.1-2
(40)
1.1-2
Su
1.1-2
1.1-2
3.1-4
3.1-4
3.3-4
(40)
3.1-2
(40)
3.3-4
3.3-4
D. Broader Impacts
D.1. Advance discovery and understanding while promoting teaching, training, and
learning
This proposed project will provide extensive research training to a postdoctoral researcher
and several graduate and undergraduate students aspiring to pursue a research and teaching
career. The trainees will be involved in all phases of the project to obtain knowledge and skills
for conducting research. The students will also be able to integrate their coursework in speech
acoustics, speech perception, and psycholinguistics into many aspects of the project. The
undergraduate students will particularly benefit from the research experience. As the instructor
of two undergraduate core courses with an average registration of 60 students per class, the PI
constantly receives requests from undergraduate students wishing to participate in lab activities.
In the last two years, the PI has worked with six undergraduate students, who participated in
stimulus preparation, data collection, and regular lab meetings. Two students received the
competitive Provost’s Undergraduate Research Fund at Ohio University to conduct research, and
all three graduating seniors successfully entered graduate programs in speech-language
pathology or audiology. Support from the grant will enable the PI to identify and engage select
undergraduate students in integrating basic science research into their education.
D.2. Broaden the participation of underrepresented groups
Due to the crosslinguistic nature of this project, the postdoctoral researcher and graduate
students to be recruited for the project are expected to be fluent in Cantonese, Mandarin, and/or
Taiwanese. This project will also allow a substantial number of speakers of the three languages
to participate in speech and language research. The trainees and participants will make a
Project Description
15
significant contribution to this project on speech perception and spoken word recognition, where
these non-English languages are relatively underrepresented.
D.3. Enhance the infrastructure for research and education
The proposed project involves topics traditionally studied in distinct disciplines including
linguistics, cognitive psychology, and speech and hearing sciences. Support from the grant will
significantly enhance the PI’s effort in maintaining existing collaborations and identifying new
collaborations to achieve the PI’s long-term goal of understanding the processing of linguistic
and nonlinguistic prosody by listeners with various characteristics. To that end, the PI has
established collaborations with linguists, language teachers, psycholinguists, clinical
psychologists, and cognitive ethnomusicologists at Ohio University and beyond. For example,
the PI has successfully collaborated with colleagues in Linguistics and the Chinese language
program at Ohio University (L. Tao and Z. S. Bond) on a series of studies on the effects of
acoustic variability on native and nonnative tone perception. Since 2008, this collaboration has
resulted in four published articles and two manuscripts under review, all in high-impact journals.
The rapid growth in the number of Mandarin-speaking students and English-speaking students
enrolling in Chinese language courses at Ohio University has allowed us these crosslinguistic
studies on lexical tone perception. Support from the grant will significantly enhance the PI’s
ability to extend this knowledge network to participating international institutions.
D.4. Broad dissemination to enhance scientific and technological understanding
Findings from this project will be disseminated to relevant scientific communities (acoustics,
linguistics, cognitive psychology, and language teaching) through journal publications,
conference presentations, and presentations at the PI’s institution and other institutions. All of
the PI’s previous studies have been successfully published in high-impact journals such as the
Journal of the Acoustical Society of America, Journal of Phonetics, Language and Speech, and
Speech Communication. Disseminating findings through these outlets is expected to reach a wide
audience including scientists, language teachers, and graduate and undergraduate students.
D.5. Benefits to society
Due to the basic science nature of the proposed project, direct benefits to society will be
limited. However, knowledge of how speaker variability is processed in tone perception may be
used as a knowledge base to develop effective means for teaching lexical tones. For English
speakers learning tone languages, lexical tones are one of the most difficult aspects to master.
Identifying how tone language experience affects the processing of speaker variability in tone
perception and word recognition can potentially contribute to improving and strengthening
instruction of tone languages. Current approaches to tone language instruction emphasize using
idealized speech materials that are clearly articulated and produced by a minimal number of
speakers. However, research in speech perception has shown that human listeners are quite adept
at compensating for these challenges and that these sources of variability could in fact contribute
to robust acquisition of foreign sound contrasts. Therefore, it will be helpful to incorporate into
instruction speech materials that are acoustically more challenging (e.g., produced by multiple
speakers) to facilitate the transition from idealized materials to connected, conversational speech.
Download