Perceptual Experiment

advertisement
The role of the plateau in spoken word recognition
Chapter 4
110
The role of the plateau in spoken word
recognition
4.1 Introduction
Results from the previous production experiments, discussed in Chapters 2 and 3, suggest
that the end of the plateau (EP) may mark linguistic structure. The alignment of this point
within the accented syllable covaries with foot structure in the same way for all speakers
but is unaffected by the non-structural factors pitch span and utterance type.
It is
conceivable that speakers use EP to signal aspects of linguistic structure such as whether
or not there are more syllables to come in the word or foot.
However, although it seems EP covaries with linguistic structure in production it is not
known whether its alignment facilitates the listener’s task in any way. Thus, the aim of
the present experiment is to see whether listeners can make use of the alignment of EP in
speech processing. As EP alignment covaries with the number of syllables in the foot,
one conceivable hypothesis is that the listener may attend to the alignment of EP within
the syllable and use this information in the process of spoken word recognition.
In order to ascertain whether EP does indeed mark linguistic structure and facilitate
processing we need to be able to experimentally manipulate the alignment of this point
within the syllable whilst holding other factors, such as the length of the preceding
plateau and the rate of change of the fall, constant.
The technical details of such
manipulations are made easy by resynthesis but it is notoriously difficult to design
experiments that evaluate the function of intonational contrasts. There are many issues to
consider in the design of both the stimuli and the task.
The role of the plateau in spoken word recognition
111
4.1.1 Issues in designing stimuli
On the face of things it may seem very easy to design stimuli for experiments concerning
intonation. For example, one could compare responses to stimuli sensitive to the feature
under test and stimuli insensitive to this feature. So for example we could compare one
set of stimuli with EP alignment modelled on natural speech, and thus sensitive to
prosodic structure, to another set with a random alignment of EP. This approach is
doubly problematic, however. Firstly, it is difficult to interpret the results from such an
experiment. Any significant results could be attributed only to the fact that listeners are
distinguishing between natural and unnatural patterns in the stimuli rather than having
their perception directly affected by the experimental factor in question. So, for example,
the set of stimuli with EP sensitive to prosodic structure may be processed more
efficiently merely because it is more natural and not because the alignment of EP
facilitates processing. Secondly, even when such results can be considered to be reliable,
they reveal very little about how the perceptual system processes natural speech, as the
random stimuli have been created using unnatural contrasts.
Another general difficulty encountered when designing stimuli for perceptual
experiments concerning intonation is that there is, of course, no one-to-one match
between intonation contours and meaning. Thus if intonation patterns are changed this
may create a different, and yet still perfectly acceptable, natural intonation contour. In
addition, intonation signals many different functions in English. For example it gives
information about grammatical structure, sentence type and focus as well as the attitudes
and emotions of the speaker. Therefore it is crucially important to resynthesise stimuli in
such a way that they will not sound unnatural or change the meaning of the utterance.
The role of the plateau in spoken word recognition
112
As the point of interest here is the alignment differences between EP in monosyllabic and
polysyllabic feet it will be interesting to see what happens to processing when this
alignment is changed. In order not to create stimuli that may lead to uninterpretable
results one solution is to investigate how processing is affected when two naturally
attested patterns are exchanged.
Therefore, in this experiment processing will be
examined when the syllable in a monosyllabic foot is synthesised with EP aligned
appropriately for the accented syllable in a polysyllabic foot, and vice versa. This is a
similar type of stimuli to that used by Ogden et al. (2000) (as described in Chapter 1)
where monosyllabic words were synthesised with correct or incorrect alignment. In this
experiment, however, stimuli will contain both mono- and polysyllabic words.
One important issue to consider is whether the duration of the syllable should be a factor
in the present experiment. As we have seen, alignment differences are in part due to the
different durations of the accented syllable in mono- and polysyllabic feet. For this
experiment, a decision was made not to include duration as a factor, as a significant result
found for stimuli containing altered alignment alone would show that alignment itself is
important regardless of the duration of the accented syllable.
4.1.2 Issues in designing the task
It is common when testing intonation for speech synthesis systems to use subjective
measures. For example, in Mean Opinion Score (MOS) tests, listeners are asked to rate
(usually on a five-point scale) whether the intonation is acceptable or natural. This is a
perfectly reasonable method for speech synthesis systems where the main measure of
quality is indeed the opinions of clients. If we want to evaluate the function of intonation,
however, it is important to use objective tasks that will tap directly into the particular
cognitive processes that we believe a contrast may affect (cf. Duffy and Pisoni 1992).
The role of the plateau in spoken word recognition
113
As it is suggested that EP is a marker of low-level linguistic structure, indicating
differences in word or foot length, it is important to employ a task that will tap into
processing at that particular level. Many methods are currently used for examining the
processes of spoken word recognition such as gating, priming and word spotting. A
simplified version of a sentence-picture verification task was chosen for use in this
experiment.
Clark and Chase (1972) report a series of experiments where subjects
compare a written sentence with a picture. Pictures in these experiments were always
simple representations containing two geometric symbols, one above the other. Subjects
made truth judgments to affirmative (‘star is above plus’) and negative (‘star isn’t above
plus’) statements and their reaction times and errors were recorded. Results from these
experiments led to a theory of sentence-picture verification that states that both elements
(sentence and picture) must be represented in the same form and then compared to each
other in order for the task to be completed. Each step of the process takes a fixed time
and therefore extra operations add more time to response latencies. As in all reaction
time experiments, more difficult tasks are indicated by longer latencies and a higher error
rate.
The sentence-picture verification task is suitable for a number of reasons. Firstly, it
allows for the measurement of both reaction time and errors, both of which are measures
of processing efficiency. In addition, it allows for the presentation of a whole sentence
rather than a single word, which is important as EP alignment must be considered as part
of a larger prosodic hierarchy. Also, as much is known about the results that can be
expected from this task with non-manipulated stimuli, it will be easy to tell whether the
test is sensitive or not when used with manipulated stimuli. Finally, the task is simple for
subjects to complete and generally leads to a relatively low error rate, so that any errors
that are made are interpretable.
The role of the plateau in spoken word recognition
114
4.1.3 Hypothesis
The hypothesis is that an appropriate alignment of EP will lead to faster reaction times
and fewer errors than an inappropriate alignment because processing will be easier. This
hypothesis entails several assumptions about the processes involved in recognition. The
first is that subjects will process the intonation contour at an adequate level of detail and
will be able to use the information. The second is that information about alignment forms
a part of the mental representation of the word and can thus be matched or checked
against alignment information in the signal. It is unlikely that EP alignment actually
forms part of a dictionary entry for a word, as its alignment is predictable from structure,
so the mental representation may take the form of a dictionary entry plus rules for the
alignment of EP depending on the structure of the word.
These issues will be discussed in further detail in section 4.4.4.2, which deals with
previous experimental studies of the use of intonation in word recognition, and section
4.4.4.3, which considers the incorporation of prosody into current models of word
recognition.
4.2 Method
4.2.1 Material
4.2.1.1 Auditory material
4.2.1.1.1
Items
Ten words (with varied onset and coda types) were chosen which are entire monosyllabic
words in their own right as well as forming the first stressed syllable of a polysyllabic
word. For example the monosyllabic word ‘cat’ also forms the first stressed syllable of
the polysyllabic word ‘catamaran’. All the words chosen are imageable nouns, as the
sentence-picture verification task requires that they be pictorially represented. The items
are shown in Table 4.1.
The role of the plateau in spoken word recognition
Monosyllabic
Cat
Guard
Guide
Key
Pan
Pea
Toad
Train
Jug
Nail
115
Polysyllabic
Catamaran
Garden
Guidedog
Keyhole
Panther
Peacock
Toadstool
Trainer
Juggler
Nailfile
Table 4.1 Items used in the experiment
4.2.1.1.1.1
Recording and measurement of natural utterances
A female native speaker of SBE recorded each of the target words in the carrier sentence
‘It’s a picture of a ____’ with falling intonation on the target word. The recordings were
made in a sound-treated room using Cool Edit and a high-quality microphone.
Measurements were taken from these utterances to ensure that they followed the pattern
found previously, specifically that the alignment of EP within the accented syllable was
later in polysyllabic than monosyllabic words.
EP was identified using methods
described for previous production experiments. The duration of the accented syllable was
measured and alignment calculated as a percentage of syllable duration from the
beginning of the syllable, again using the same method as in previous production
experiments. The rate of change (in Hertz per second) was determined by identifying the
low tone, measuring the difference in Hertz between EP and L, and dividing this distance
by the difference in seconds between the two points. A schematic representation of these
measurements can be seen in Figure 4.1, and the results of these measurements are given
in Table 4.2. In general the patterns found mirror those from the production experiments
in Chapters 2 and 3; EP is aligned later in a shorter syllable in polysyllabic words. There
is also a trend for the rate of change in the fall to be greater in the monosyllabic word.
The role of the plateau in spoken word recognition
Item
Cat
Guard
Guide
Key
Pan
Pea
Toad
Train
Jug
Nail
Monosyllabic
Syllable
EP
duration
alignment
(ms)
(% of
syllable)
574
34
484
30
524
20
472
45
551
31
448
40
556
29
610
32
566
22
403
42
Rate of
change
(Hz/s)
367
185
203
384
188
516
328
260
369
267
Polysyllabic
Syllable
EP
duration
alignment
(ms)
(% of
syllable)
287
46
373
34
323
49
289
63
370
46
235
83
295
45
356
53
247
36
314
77
Item
Catamaran
Garden
Guidedog
Keyhole
Panther
Peacock
Toadstool
Trainer
Juggler
Nailfile
Table 4.2 Measurements taken from natural utterances
EP

Hz

L
ms
Syllable duration (ms)
Figure 4.1 Schematic representation of measurements taken
116
Rate of
change
(Hz/s)
181
112
124
491
447
279
120
281
317
198
The role of the plateau in spoken word recognition
4.2.1.1.1.2
117
Resynthesis
All resynthesis was carried out in PRAAT. A single carrier sentence was resynthesised
by specifying turning points at seven points in the intonation contour. Within each target
word the pitch contour was specified at four points and was then spliced onto the
resynthesised version of the carrier sentence. As discussed above in section 4.1.1, the
design of the experiment involves swapping the alignment of EP between mono- and
polysyllabic words. To this end two versions of each target word were created. The
alignment of EP in each case was determined by whether the utterance was to have
correct or incorrect alignment. So, in a correct version of ‘cat’ EP is aligned naturally for
‘cat’ whilst in an incorrect version EP is aligned at the percentage of the syllable found in
‘catamaran’. The situation is reversed for correct and incorrect versions of ‘catamaran’.
The frequency of EP was always the same as in the natural utterance, regardless of
whether a correct or incorrect version was created. In order to create a plateau another
pitch point, representing SP, was added 70 ms (the average plateau duration for this
speaker) before EP at the same frequency. A further pitch point was added 80 ms before
SP to create the rise to the plateau. The alignment and frequency of the turning point
approximating the low tone were specified so as to maintain the rate of change found for
the natural utterance. So, for example, ‘cat’ always had the natural rate of change for
‘cat’ regardless of whether the EP was located for a correct or an incorrect utterance.
These factors are shown below in Table 4.3. Crucially, incorrect versions of each item
differ only in the alignment of the plateau (based on EP) and not in terms of any other
potentially distinguishing factor.
Correct
Incorrect
Duration of
syllable
Cat
Cat
EP
Alignment
Cat
Catamaran
EP
Frequency
Cat
Cat
Table 4.3 Factors for correct and incorrect versions of ‘cat’
Rate of
Change
Cat
Cat
Low Tone
Frequency
Cat
Cat
The role of the plateau in spoken word recognition
4.2.1.1.2
118
Fillers
Ten pairs of fillers were chosen; these are shown in Table 4.4. Every filler was an
imageable noun. The members of a pair were either phonologically or semantically
related. Although not modified in any way, fillers were all resynthesised so they would
sound consistent with the test items and were then spliced onto the same carrier phrase
used for test items.
Burger
Gibbon
Television
Table
Pen
Ring
Apple
Saw
Bow
Tank
Burglar
Ribbon
Telephone
Chair
Pin
Herring
Pineapple
Jigsaw
Elbow
Fishtank
Table 4.4 Fillers used in the experiment
4.2.1.2 Picture material
A selection of pictures was chosen that were considered to be good visual examples of
each target word. In an informal rating experiment ten native speakers (different from
those in the main experiment) rated these pictures on the degree to which each was a
typical exemplar of the target word. The picture rated highest for each word was then
given to ten different subjects (again different from those in the main experiment) who
were asked to provide a label. Pictures were chosen for use in the final experiment only if
they were labelled correctly by seven of these subjects.
4.2.2 Subjects
Forty subjects were tested. All were monolingual speakers of British English residing in
Cambridge at the time of the experiment. None reported hearing or speech problems and
none were epileptic. Their ages ranged from 18 to 50 (mean 30). Thirty-eight were right
handed and two were left-handed. Subjects were paid a small fee for their time.
The role of the plateau in spoken word recognition
119
4.2.3 Training
4.2.3.1 Auditory training
Previous production experiments have shown that although all speakers align EP later in
the accented syllable in polysyllabic than monosyllabic feet, the exact percentages are not
the same for each speaker.
For this reason, subjects were given auditory training
designed to expose them to this particular speaker’s voice and acclimatise them to the
particular patterns used to distinguish mono- and polysyllabic words without exposing
them to the target items themselves. Smith (2003) indicates that listeners need only a
small amount of training to become aware of linguistically-relevant talker characteristics
and thus the length of the auditory training was approximately three minutes. Subjects
listened to a fairy story (reproduced in Appendix C), read by the speaker of the test
sentences, over headphones. The story contained ten pairs of words similar to those used
in the main experiment, in that one member was a monosyllabic word and the other had
that word embedded as the first strong syllable of a polysyllabic word. Members of each
pair were far apart in the story to minimise the likelihood of subjects being alerted to the
purpose of the experiment.
4.2.3.2 Picture training
It was important to ensure that any differences in subjects’ responses were only due to the
auditory component and were not affected by factors associated with the pictures. This
had been partially controlled for by taking care to select the best picture to represent each
target word as described in section 4.2.1.2. However, it was still possible that some
pictures were better exemplars than others and also that different subjects might have
different prototypes for the same concept, which could cause unwanted differences in
subject responses. For this reason, subjects were trained to associate each picture with the
correct target word or filler. In this training session they were shown a picture on a
computer screen followed by a one-word label for that picture. They were shown each
target picture and label three times (60 in all) and also saw the same number of filler
items. Target and filler items were randomised together, with a different randomisation
being created for each presentation for each subject.
120
The role of the plateau in spoken word recognition
Next, subjects were shown the same pictures (again in a different random order) without
the label and asked to tell the researcher the label they remembered. If any mistakes were
made they were shown each picture and label again before being retested on their recall.
This procedure was repeated until they could remember all the labels. Most subjects
could remember all the labels after the initial three presentations.
4.2.4 Setup and instructions
Subjects were tested individually in a sound-treated room. Training and testing took
place in a single session and all parts of the test were run using DMDX
(http://www.u.ariozona.edu/~kforster/dmdx/dmdx.htm) on a PC. Picture stimuli were
shown centralised on the screen and utterances were heard over headphones. The light in
the room was dimmed so it did not reflect onto the screen. Subjects were asked to sit
close to the desk so that they were comfortable.
Each version of each target word was heard once paired with the target picture (to elicit a
true response) and once paired with the picture representing the opposite member of the
pair (to elicit a false response). Items and fillers were pseudo-randomised together with
the condition that no picture or utterance occurred in two successive presentations. Table
4.5 shows the different presentations of each target word.
Target Word
Cat
Cat
Cat
Cat
Synthesis Version
Correct
Correct
Incorrect
Incorrect
Table 4.5 The four presentations of each target item
Picture
Cat
Catamaran
Cat
Catamaran
Truth-value
True
False
True
False
The role of the plateau in spoken word recognition
121
The left and right shift keys of the keyboard were labelled T (true) and F (false). The
same labels were placed on either side of the screen at eye-level. Subjects were instructed
to respond ‘true’ if the target word matched the picture and to respond ‘false’ if it did not.
Half of the right-handed subjects (16 in total) and half of the left-handed subjects (1
subject) responded ‘true’ with their right hand. The other subjects responded ‘true’ with
their left hand.
Subjects were instructed to keep their index fingers resting on the
appropriate keys at all times and to respond as quickly and as accurately as possible.
There was a break after every 30 responses and subjects could start the experiment again
in their own time.
Subjects were given a short practice test before the main experiment in order to
familiarise them with the format of the experiment. In this practice they saw each of the
filler items twice, once paired with the appropriate picture and once paired with the
picture for the other member of the pair.
On average subjects took about 20 minutes to complete the experiment and 30 minutes to
complete the training.
4.3 Results
4.3.1 Statistical procedure
Reaction times for correct responses were measured from the beginning of the target
syllable and errors were counted. Responses that took longer than three seconds were
considered to be non-responses. A log transform was applied to reaction times in order to
normalise the data enabling the use of parametric statistical procedures. For both reaction
times and errors, means were calculated for each combination of the variables foot type
(monosyllabic or polysyllabic), alignment (correct or incorrect) and truth-value (true or
false) making eight dependent variables for input into repeated measures (MANOVA)
analysis. Each dependent variable was created from all ten relevant responses from each
subject unless data points were missing (due either to the subject not responding,
responding too slowly or responding incorrectly) in which case means were calculated
only on the basis of the remaining data points.
The role of the plateau in spoken word recognition
122
4.3.2 By subject
4.3.2.1 Reaction times
Response times to monosyllabic items are significantly faster than those to polysyllabic
items (F(1,39) = 2.833, p=0.05) as shown in the upper panel of Figure 4.4 on page 127.
Response times to true utterances are significantly faster than those to false utterances
(F(1,39) = 58.00, p<0.01) as shown in the upper panel of Figure 4.2 on page 125. There
is no significant effect of alignment (F(1,39) = 0.3635, p>0.05) as shown in the upper
panel of Figure 4.3 on page 126. There was a significant interaction between alignment
and truth (F(1,39) = 3.22, p<0.05). This was further analysed by means of t-tests, which
indicated that responses to true items were quicker with incorrect than correct alignment
(t (39) = 1.715, p<0.05).
The difference in reaction times between mono- and polysyllabic items implies that these
should be analysed separately. To this end two further repeated measures analyses were
conducted for mono- and polysyllabic items separately. Each analysis had factors of truth
and alignment. Results for alignment are shown in Figure 4.4.
Reaction times to monosyllabic items are quicker when the proposition is true than when
it is false (F(1,39) = 46.950, p<0.05). Reaction times are not significantly affected by
alignment although there is a trend for reactions to be quicker with correct alignment
(F(1,39) = 0.285, p>0.05).
Polysyllabic items are also responded to faster when the proposition is true than when it is
false (F(1,39) = 42.255, p<0.01). Interestingly there is also a weakly significant trend for
reaction times to be quicker with incorrect alignment (F(1,39) = 2.516, p=0.06) and it is
this result that leads to the overall finding of quicker reaction times to true items with
incorrect alignment.
The role of the plateau in spoken word recognition
123
4.3.2.2 Errors
Overall, the error rate is very low. Out of 3,200 responses there were only 60 errors, a
rate of 1.9%. This is a common finding for sentence picture verification tasks, for
example Clark and Chase (1972: 492) find error rates to be between 4.7% and 12.8% for
positive sentences when the picture is presented first.
There were more errors to monosyllabic than polysyllabic items (F(1,39) = 5.151,
p<0.05) as shown in the lower panel of Figure 4.4. There were more errors to false than
true propositions (F(1,39) = 3.136, p<0.05) as shown in the lower panel of Figure 4.2.
There was a weakly significant effect of alignment with more errors made to incorrect
versions than correct versions (F(1,39)=2.145, p=0.08) as shown in the lower panel of
Figure 4.3. In the same way as for reaction times, mono- and polysyllabic items can be
analysed separately. For monosyllabic items, errors are not affected either by the truthvalue of the proposition (F(1,39)=0.188, p>0.05) or by the alignment of EP (F(1,39) =
0.526, p>0.05). Interestingly, more errors are made to polysyllabic items when the
proposition is false (F(1,39) = 2.951, p<0.05) and when incorrect alignment is used
(F(1,39) = 3.690, p<0.05).
4.3.3 By item
A further analysis was conducted to examine the differences in reaction time and errors
between correct and incorrect versions of each item. This analysis used paired t-tests
with a single factor of alignment over items rather than over subjects.
Overall, reaction time is not significantly affected by the alignment (t(19) = 0.089,
p>0.05). There is also no significant effect when monosyllabic (t(9) = 0.426, p>0.05) and
polysyllabic items (t(19)=0.532, p>0.05) are analysed separately. Errors show a weakly
significant effect overall: more errors are made when the incorrect alignment is used
(t(19) = 1.629, p=0.06). Monosyllabic items show no significant effect of alignment on
errors (t(9) = 0.410, p>0.05) but there are more errors to polysyllabic items synthesised
with incorrect alignment (t(9)=2.400, p<0.05).
version of each item is shown in Table 4.6.
The number of errors made to each
The role of the plateau in spoken word recognition
Monosyllabic
Item
Cat
Guard
Guide
Key
Pan
Pea
Toad
Train
Jug
Nail
Total
Correct
0
1
2
5
3
0
0
1
2
3
17
124
Polysyllabic
Incorrect
0
5
2
1
2
1
3
0
1
5
20
Item
Catamaran
Garden
Guidedog
Keyhole
Panther
Peacock
Toadstool
Trainer
Juggler
Nailfile
Table 4.6 Number of errors made in response to each target word
Correct
0
0
0
1
0
1
2
1
1
0
6
Incorrect
1
3
1
5
1
0
3
1
1
1
17
The role of the plateau in spoken word recognition
125
Figure 4.2 Response times (top panel) and number of errors (bottom panel) to true and false
propositions, pooled over subjects
The role of the plateau in spoken word recognition
126
Figure 4.3 Reaction times (top panel) and errors (bottom panel) to items with correct and incorrect
alignment, pooled over subjects
The role of the plateau in spoken word recognition
127
Figure 4.4 Reaction times (top panel) and errors (bottom panel) to mono- and polysyllabic items with
correct and incorrect alignment, pooled over subjects
The role of the plateau in spoken word recognition
128
4.4 Discussion
The results have shown that there are interesting effects of alignment and also effects of
foot type and truth-value. I will deal first, with the final two of these results, which are
essentially side issues, before returning to the main issue of alignment in section 4.4.3.
4.4.1 Foot type
Firstly, monosyllabic words are responded to more quickly than polysyllabic words. This
is a typical result in experiments on spoken word recognition, at least when responses are
measured from target word onset rather than offset.
For example, Grosjean (1980)
investigated the effects of word length based on the number of syllables in a word. A
gating experiment was used in which successively longer portions of a word are presented
and after each presentation subjects are asked to say what they think the word is and to
indicate how confident they are about their answer. Results showed that polysyllabic
words were identified at larger gate sizes (i.e. when more of the word had been presented)
than monosyllabic words (although monosyllabic words were often not identified until
after their acoustic offset).
In addition, Craig and Kim (1990) investigated the effects of word duration per se. Again
using a gating procedure, they measured the average number of gates needed to identify
words of different durations. They found that longer words are not identified until more
of the word has been presented than is necessary to identify shorter words. Also, the
proportion of the word presented (and not just the total duration) has to be greater in order
for longer words to be identified. Shorter words however, were often not identified with
a high confidence level, even at the final gate. As the present experiment also found that
responses to monosyllables were quicker than those to polysyllables, this is an indication
that the sentence-picture verification task used is sensitive.
129
The role of the plateau in spoken word recognition
Another factor, other than word length itself that may affect reaction times to mono- and
polysyllabic words is frequency of occurrence. Generally in English, polysyllabic words
are of lower frequency of occurrence than monosyllabic words. A survey of the British
National Corpus (http://sara.natcorp.ox.ac.uk) reveals that this is the case for all the pairs
in this experiment except ‘guard’ and ‘garden’. These results, taken from unlemmatised
counts of both written and spoken material, can be seen below in Table 4.7 where the
number of occurrences per million words is shown for each item. It is very difficult to
control for frequency effects in an experiment of this type where structural detail is so
important and it would be almost impossible to find words of equal frequency that also
meet the other requirements for items.
Nevertheless, the different frequencies of
occurrence for mono- and polysyllabic words probably contribute to the differences in
recognition time.
Monosyllabic
Item
Cat
Guard
Guide
Key
Pan
Pea
Toad
Train
Jug
Nail
Polysyllabic
Frequency
per million
3801
483
3844
2375
1183
151
221
5929
547
494
Table 4.7 Frequencies for items in the experiment
Item
Catamaran
Garden
Guidedog
Keyhole
Panther
Peacock
Toadstool
Trainer
Juggler
Nailfile
Frequency
per million
76
9537
Unattested
122
123
135
26
971
24
Unattested
The role of the plateau in spoken word recognition
130
4.4.2 True and false propositions
Reaction times to true propositions are quicker than those to false propositions. This is a
very common finding in sentence-verification tasks. Clark and Chase (1972) demonstrate
that in response to affirmative statements, like the ones used in this experiment, subjects
are quicker to respond correctly to a true than to a false utterance.
This extra
‘falsification time’ (Clark and Chase 1972: 483) is a robust finding in the literature. Clark
and Chase’s (1972) explanation for the effect is that subjects initially assume that all
sentences are true and will only decide they are false when there is contradictory
evidence. Thus, deciding an utterance is false adds an extra step to the process of
verification, as the subject must change the value of a truth index from true to false. In
turn this extra step adds a fixed amount of extra time to the subject’s reaction.
Clark and Chase (1972) find falsification time to be 142 ms when pictures are presented
before sentences. In the present experiment true utterances are verified in 1700 ms on
average whereas false utterances are verified in 1792 ms, a difference of 92 ms. This is
quite close to the Clark and Chase (1972) result and may be shorter because this task is
simpler, requiring the evaluation of only one noun rather than two in the earlier study
(where the nouns were always ‘star’ and ‘plus’). There may also be an effect of sentence
medium as Clark and Chase (1972) used written sentences as opposed to the spoken
sentences in this experiment. The presence of the falsification effect in this experiment
again indicates that the task does work and is an effective way to measure processing.
4.4.3 Alignment
Although the effect of alignment is not significant overall we have seen that an interesting
effect emerges when the two foot types are separated. Reaction times to monosyllabic
items are unaffected by alignment although there is a trend for responses to be quicker
with correct alignment. The situation is reversed for polysyllabic words however: there is
a weakly significant trend for reaction times to be faster when polysyllabic words are
synthesised with the incorrect alignment. This result is contrary to that predicted and
seems at first sight to also be incompatible with the original hypothesis that items will be
processed more quickly with appropriate alignment.
The role of the plateau in spoken word recognition
131
However, if we think of the alignment differences in the stimuli in terms of ‘early’ and
‘late’ alignment rather than ‘correct’ and ‘incorrect’ the situation is clearer.
For
polysyllabic items, incorrect alignment means alignment earlier in the syllable than would
occur naturally. Taking the pair ‘cat’ and ‘catamaran’ as an example, Table 4.2 shows
that the duration of ‘cat’ is 574 ms. Therefore in correct versions alignment at 34% of the
syllable puts EP at 195 ms whereas alignment in incorrect versions is later at 46% or 264
ms. In ‘catamaran’ the situation is reversed. The duration of the accented syllable is 287
ms and correct alignment puts EP at 46% or 132 ms. In incorrect versions however
alignment is earlier at 34% or 98 ms. It is plausible therefore that it is this earlier EP
rather than an appropriate EP alignment that speeds up reaction times.
The exact
mechanisms of how an early EP might contribute to the process of recognition are
discussed below in section 4.4.4.3.
The effects of alignment on errors also differ between mono- and polysyllabic items. An
exploration of the data for monosyllabic items alone reveals that the errors are somewhat
random as they are unaffected either by the alignment of EP or by the truth-value of the
proposition. This is an example of a speed-accuracy trade-off. Monosyllabic words, as
we have seen, are responded to more quickly than polysyllabic items, but are
correspondingly also more error prone. Comparing the open and closed circles in Figure
4.5 reveals that, for monosyllables, there is little difference in speed or number of errors
when different alignments are presented.
The story for polysyllabic items is rather different. In this case there are significantly
more errors when alignment is incorrect than when it is correct as a comparison of the
open and filled triangles in Figure 4.5 shows. We have already seen that responses to
polysyllabic words are quicker when the incorrect intonation is used and have postulated
that the earlier EP under these conditions may speed recognition. It is possible that this
early EP may also lead subjects to make errors caused by the erroneous belief that enough
of the word has been heard to make a correct response. However, this cannot always be
the case.
Although subjects do make errors to polysyllabic items when the wrong
alignment is used, the majority of their responses are correct. This suggests that early
alignment facilitates speed of recognition, usually allowing for a correct judgement to be
made but on some occasions leading to errors.
The role of the plateau in spoken word recognition
132
Why would an early EP only sometimes lead to errors being made? Whether an error is
made or not may depend to a large extent on the durations of the mono- and polysyllabic
words in question and the alignment of the individual EPs. Imagine for example that
there is a big difference between the durations of the target syllable in mono- and
polysyllabic words. In the incorrect version of the polysyllabic word EP will occur early
when subjects may have heard very little of the syllable. In this case they might have
missed out on other important cues to word identity such as coarticulatory effects. When
the syllables in mono- and polysyllabic words are not so different in duration however,
the listener has indeed heard enough of the word and therefore an early EP allows for
quick and accurate word recognition. Monosyllabic words are not slowed down by a late
EP, however, because other cues have already allowed their recognition before EP is
heard.
Figure 4.5 Speed-accuracy trade-off (mean number of errors and reaction time for each condition,
pooled over subjects)
The role of the plateau in spoken word recognition
133
4.4.4 The role of prosody in spoken word recognition
It is important to discuss how these findings fit in with other research on spoken word
recognition. In section 4.4.4.1 I will describe three models of spoken word recognition
and discuss how these models do not include a prosodic component. Section 4.4.4.2 will
consider previous experimental evidence concerning the role of prosody in word
recognition whilst section 4.4.4.3 will consider how models might be modified to make
use of prosody in the light of the findings presented in this chapter.
4.4.4.1 Models of spoken word recognition
Spoken word recognition is generally assumed to be accomplished by activation and
competition of words in the mental lexicon. Models of spoken word recognition may be
distinguished in various ways such as the number of stages involved, the use of top-down
information and the methods used to identify word boundaries.
In the Cohort model (Marslen-Wilson and Welsh 1978), activation and competition are
considered to be separate stages. The activation stage involves setting up a word-initial
cohort based only on the first 150 ms of the word whilst the competition stage involves
the elimination of any words in this cohort that do not match further incoming sensory
information. The word that is recognised is the final word left in the cohort after the
others have been eliminated. Newer versions of the model (e.g. Marslen-Wilson 1993)
maintain the same principle but the elimination stage involves a gradual decay of
activation for non-matching items rather than complete elimination and words are
recognised when one reaches a particular threshold.
In both older and newer versions of the model words are recognised sequentially so that
the recognition of one word allows a new cohort to be set up for the next. The earlier
version allows for the use of top-down information at all stages whilst the later versions
restrict this use to a post-lexical stage of integration where the syntactic and semantic
properties of the word are accessed.
The role of the plateau in spoken word recognition
134
Connectionist models such as TRACE (McClelland and Elman 1986) do not separate the
activation and competition stages of the process.
TRACE is a highly interactive
connectionist network that makes heavy use of top-down information. TRACE consists
of three levels of nodes corresponding to phonological features, phonemes and words.
Activation of one node spreads to compatible nodes on other levels whilst inhibiting the
activation of nodes on the same level, so for example, the activation of one node on the
word level inhibits the activation of other word nodes but facilitates activation of
compatible phonemes and features. A word is recognised when there is stable activation
in the network and only one word node is activated. TRACE deals with the problem of
identifying word boundaries by treating every part of the speech stream as a potential
word onset and setting up a new network at every time-slice of the speech stream. In this
way, not only do words with the same initial phonemes compete with each other (as in the
Cohort model), but words beginning in different ways also compete.
The SHORTLIST model (Norris 1994) contains elements of both the Cohort model and
TRACE. Activation and competition are separate stages, with activation being a bottomup process, which creates a shortlist of candidates in some ways like the initial stage of
the Cohort model. In this model, however, the shortlist contains candidates beginning at
every phoneme in the input rather than only words matching the first 150 ms of input.
The competition stage is reminiscent of the lexical level of TRACE, whereby a spreading
activation network allows the words in the shortlist to compete with each other using both
bottom-up and top-down information. Word boundaries are detected in the same way as
in TRACE (by treating every phoneme as a potential word onset) but, as activation and
competition are separate, without the extra computation needed to create a new network
at each time slice.
The role of the plateau in spoken word recognition
135
All three of these models of word recognition have little to say about the role of prosody.
This apparent deficiency is due to many factors.
Firstly, models of visual word
recognition, in which prosody obviously does not play a part, have often influenced the
development of models of spoken word recognition. Secondly, models of spoken word
recognition often consider only the recognition of monosyllables, in which prosodic
information is limited. Thirdly, the input to these models is often rather unrealistic,
taking the form, for example, of a string of phonemes or phonological feature bundles.
This also makes it difficult to allow for the inclusion of prosodic information. Finally,
there is no real consensus on the role played by prosody in the recognition process. The
next section will discuss some experimental findings that highlight this debate.
4.4.4.2 Experimental studies of prosody in spoken word
recognition
The role suggested for prosody in spoken word recognition depends to some extent on the
language in question and also upon the particular model of word recognition or
theoretical framework under consideration. For example, one suggestion for the role of
prosody arose to address the deficiencies of sequential recognition models such as the
Cohort model (e.g. Marslen-Wilson and Welsh 1978). In the Cohort model words may,
in theory, be recognised before their acoustic offset, allowing very fast recognition. This
is possible if a word’s uniqueness point (the point at which it becomes different from any
other word in the lexicon) is before the acoustic offset, making it the only word remaining
in the cohort.
The evidence from distributional studies, however, suggests that this simple version of the
model will not work. Luce (1986) identified the uniqueness point for each of 20,000
words and found that it did indeed occur before the offset of the word in 60% of cases.
However, in English and other languages, short words are very often embedded in longer
words as pointed out by, for example, McQueen and Cutler (1992). So, in the present
experiment for example, ‘cat’ is embedded in ‘catamaran’, and so on. This embedding
means that the uniqueness point for short words like ‘cat’ is often after their offset. Short
words are also more frequent in the language than long words. When Luce (1986) took
frequency into account he found that the probability of a word’s uniqueness point being
before its offset was only 0.39.
The role of the plateau in spoken word recognition
136
It is in this respect that prosody has been considered to be important in the recognition
process. If we consider that at the activation stage the listener is sensitive not only to the
segmental phonology but also to the prosodic structure of the utterance then it is likely
that there will be more success in identifying the correct word quickly. As the number of
words in the lexicon matching both the segmental phonology and the prosody of the input
will be small, there will be very few words considered at the competition stage, allowing
the correct word to be isolated quickly.
Work by Winfield et al. (1997) indicates that this explanation may be tenable. In a gating
experiment, where successively larger portions of a word are played, subjects recognised
words at gate sizes where the cohort size was still very large if only segmental phonology
was considered. When the cohorts were re-estimated to take prosody into account,
however, they were actually much smaller and could potentially explain the subjects’
ability to recognise words so early. One difficulty in interpreting the result from this
experiment however is that prosody is taken to be a whole collection of parameters
including vowel quality, intonation, amplitude and duration.
The role of the plateau in spoken word recognition
137
Work by Cutler and colleagues (e.g. Cutler and Norris 1988) suggests that vowel quality
information is crucially important in the recognition process. Cutler and Norris’ (1988)
Metrical Segmentation Strategy suggests that vowel quality is used to explicitly segment
the speech stream. This work is based on the observation that most words in English
have a strong first syllable (where vowels are unreduced and the syllable can potentially
carry stress). Therefore, strong syllables could be a good basis on which to initiate lexical
access as the listener can assume that a strong syllable in the input is probably wordinitial.
This view has been largely supported by experimental work.
Cutler and
Butterfield (1992), for example, induced misperceptions by playing subjects
unpredictable sentences at low intensities. They found that word boundary insertions in
the misperceptions were more common before strong syllables whereas word boundary
deletions were more common before weak syllables.
For example the sentence
“conDUCT asCENTS upHILL” (where capitals indicate a strong syllable) might be
(mis)perceived as “the DOCtor SENDS her BILL” (Cutler and Butterfield 1992: 228).
Spontaneous mishearings in which word boundaries are inserted before strong syllables,
for example perceiving “a meCHANical HORSE” as “I’m a CANNibal HORSE”
(author’s unpublished corpus), are common. Statistical analyses of such corpora (e.g.
Cutler and Butterfield 1992) also support the Metrical Segmentation Strategy.
Cutler (1986) goes further, by stating that it is only vowel quality and not other features of
stress that affect word recognition in English. For example a cross-modal associate
priming experiment demonstrates that presentation of either member of minimal stress
pairs (such as ‘ForBEAR’ and ‘FORbear’) primes words associated to both members.
Cutler (1986) suggests that the suprasegmental differences between such pairs are not
able to constrain lexical activation and as such these pairs are homophonous.
The role of the plateau in spoken word recognition
138
This under-use of suprasegmental cues for word recognition appears not to be found in
other languages. Cutler and van Donselaar (2001) investigate whether pairs such as
‘VOORnam’ (first-name) and ‘voorNAM’ (respectable) are phonetically equivalent from
the point of view of lexical access (or ‘homophonous’ in Cutler’s terms), in Dutch. In
one experiment, an auditory lexical decision task is used to show that one member of this
minimal pair does not facilitate responses to the other (unlike in English) although
repetition priming (where one member of the pair facilitates recognition of itself) is
reliably found.
This suggests that suprasegmental cues are used in the recognition
process in Dutch.
Why should it be the case that Dutch listeners use suprasegmental cues and English
listeners do not? Dutch and English are very similar in that both are stress timed, have an
opposition between strong and weak syllables and have lexical stress.
Cutler and
colleagues (e.g. Cutler et al. 1997) suggest that the different findings for different
languages are related to differences in word prosody.
In English, there is a close
correspondence between vowel quality and stress, in that stressed syllables always
contain full vowels whilst unstressed syllables usually contain reduced vowels. This is
not the case in many other languages, such as Dutch, where it is common to have full
vowels in unstressed syllables. Cutler et al. (1997) suggest that in English, stress is
unambiguously signalled by vowel quality, and therefore listeners are not advantaged by
paying attention to other, suprasegmental, correlates of stress. In addition, information
about word identity from segments may be available earlier than information from other
sources. Cooper et al. (2002: 209) discuss how cues found within vowels already indicate
the identity of the following consonant and can be used for word recognition.
The role of the plateau in spoken word recognition
139
The view that English listeners do not use suprasegmental information for word
recognition, preferring to rely on the cues to stress provided by vowel quality, has been
rather pervasive in the literature. Nevertheless, an experiment by Lindfield et al. (1999)
does attempt to assess the contribution of suprasegmental aspects of stress to word
recognition in English. Using a similar method to Winfield (1997), three different gating
experiments were conducted. The first was a standard gating experiment where the onset
of a word was presented in increasingly larger fragments. In the second condition, the
remainder of the word after the gate was filled in with white noise, informing the subject
about the duration of the word. In the third condition, the remainder of the word was
filled in by band-pass noise, informing the subject about the duration of the word and also
the number and stress of the syllables. Results showed that subjects recognised words at
smaller gates in the third condition than in either of the other two conditions whilst there
was no significant difference between recognition points in the first two conditions. The
authors interpret this result as showing that prosody (and seemingly only syllabic stress),
is used to constrain the word initial cohort.
Lindfield et al’s. (1999) result does not sit well with the more generally accepted view put
forward by, for example, Cutler et al. (1997) that suprasegmental correlates of stress are
not useful to English listeners. Recently however, it has again been suggested that the
role of these suprasegmental correlates in English may have been underestimated.
Cooper et al. (2002) demonstrate that in a cross-modal priming task, English listeners’
responses are facilitated by primes that are stress matched. This is the case for both one
and two syllable primes.
Thus, ‘ADmi-’ primes ‘ADmiral’ more than it primes
‘admiRATion’ and ‘MUS-’ primes ‘MUSic’ more than it primes ‘muSEum’.
Cooper et al’s. (2002) results are further evidence that English listeners can indeed
exploit suprasegmental information in word recognition. There is still, however, evidence
that English listeners find this task more difficult than, say, Dutch listeners. It seems that
English listeners can use information from a two-syllable prime more effectively than
from a one-syllable prime. In the one syllable case there is also facilitation from a stressmismatched (but segmentally identical) prime whilst this is not the case for the disyllabic
primes (Cooper et al. 2002). Dutch listeners only show a priming effect for primes that
match in both stress and segmental identity.
The role of the plateau in spoken word recognition
140
So far none of these results have much to say about the issue of intonation in word
recognition other than the role it might play as part of a more general stress parameter.
Recently however, event-related potentials (ERPs) have been used to try to illuminate this
issue.
Boecker et al. (1999) identified a negative going ERP 325 ms after word onset (N325) for
words that contain a reduced vowel in German.
N325 is not found when only a
difference in stress is present (Friedrich et al. 2000) suggesting that vowel quality and
other elements of prosody do indeed play different roles in speech processing.
Friedrich et al. (2001) study the pitch contour as a single correlate of stress where a
higher, rising F0 is associated with the stressed syllable. For German listeners they report
a positive potential 200 ms after word onset (P2), which is larger when the first syllable
of a disyllabic word is unstressed than when it is stressed. Friedrich et al. (2002) report
that not only do stressed primes facilitate recognition of words with stressed first syllables
(regardless of segmental identity) but that there is also a larger positive potential 350 ms
after word onset (P350) when there is a stress mismatch.
Results from these ERP studies are interesting, as they suggest that the pitch contour is
extracted early on in the process of word recognition. Nevertheless, as we have seen,
speakers of different languages may rely on different aspects of prosody to different
extents, so it is not appropriate to extend these results from German to English listeners
without further investigation.
4.4.4.3 A proposed role for prosody in models of spoken word
recognition
It seems, given the experimental evidence discussed above and the results from the work
presented in this chapter, that prosody does indeed play some role in spoken word
recognition. This section will consider how prosody might affect recognition in relation
to the three models described above and what exactly the role of EP alignment might be.
The role of the plateau in spoken word recognition
141
There are two obvious ways in which prosody might be used in spoken word recognition.
In the first alternative prosody might operate at the activation stage of the process
influencing which words are considered for recognition.
This would have two
implications for models. Firstly the input to models would have to include prosodic
information rather than just a string of phonemes or features. This is easily accomplished
in the Cohort model where prosodic information as well as segmental information in the
first 150 ms of input could be considered. In SHORTLIST and TRACE the problem is
more difficult to solve, as the inputs to these models are phonemes in Shortlist and
phonological features in TRACE. A general solution is for a more realistic input or for
additional channels or layers of nodes dealing specifically with prosody. The second
implication of this approach is that the mental representation of a word must also have
access to prosodic information. This information may form part of the entry for the word
in the mental lexicon or, more likely, may exist as a set of rules that allow predictable
aspects of prosody (such as the alignment of EP) to be established from phonological
structure. The prosodic information, in whichever form it takes, will allow the word to be
activated by any congruent information in the input.
The second alternative is that prosody might be used at the competition stage. So, words
might be initially activated on the basis of segmental phonology alone but then might
receive more activation if they match the prosody of the input as well. McQueen et al.
(1994: 235) suggest that the Metrical Segmentation Strategy could be incorporated into
the SHORTLIST model in this way. The Metrical Segmentation Strategy would act as a
bias in the competition process causing words with strong first syllables to receive more
activation than those with weak first syllables. In this suggestion the input to models
would not need to be enriched as words with strong first syllables receive extra activation
regardless of the strength of syllables in the input. Lexical entries, however, need to be
specified for syllable strength. A different version of this suggestion might be similar to
that given in the first alternative above, namely that matching prosody biases the
competition process by virtue of both richer lexical entries (or sets of rules) and a richer
input.
The role of the plateau in spoken word recognition
142
The results found for EP alignment, whilst suggesting that it is indeed used in spoken
word recognition, do not seem to be accounted for by either of these two alternative
explanations. It seems unlikely that EP alignment is used to constrain the items initially
activated and considered for recognition. If this were the case, the number of items
considered would potentially be very small. Depending on the fine-grainedness of the
distinctions represented in the input and lexicon, there may be only one word activated
and put forward to the ‘competition’ stage. This should mean that subjects make very fast
decisions regardless of whether alignment is correct or incorrect as there are so few items
left at the competition stage. It should also mean that subjects are always correct with
correct alignment but always make errors with incorrect alignment. As we have seen
however, this is not the case. Although subjects do make more errors when alignment is
incorrect the number of errors is still very few. Conversely some errors are made when
alignment is correct. These results suggest that alignment does not constrain the initial
activation of items, which is not surprising given the rather variable alignment of EP
between speakers. This theory would be particularly unworkable for the Cohort model as
EP is not reliably found within the first 150 ms of the word, which is considered to be
used as the basis for activation.
It is also unlikely that EP acts as a bias at the competition stage by virtue of being
represented in lexical entries (or by rules that assign its alignment on the basis of
structure). If this hypothesis were correct, items with matching alignment would be
activated more than items with non-matching alignment and responses to these more
highly activated items would be quicker. As we have seen, however, there is no overall
effect of alignment. Even when responses to mono- and polysyllabic items are separated
there is no affect of alignment for monosyllables and for polysyllables responses are
actually quicker with incorrect alignment. This suggests that EP in the input is not
matched to EP in lexical entries at the competition stage.
The role of the plateau in spoken word recognition
143
A final alternative might be that the intonation component, studied here with reference to
EP alignment, does not form part of the lexical entry for individual words, but instead is
processed separately as a guide to the proximity of the end of the word or phrase. This
process could work in two possible ways. In the first alternative, when a person hears EP,
they work out how many syllables are left in the word and facilitate the activation of
items with that number of syllables at the competition stage. The distinction is likely to
be between mono- and polysyllabic words as these were found to be significantly
different in speakers’ productions in Chapter 2. Alternatively they might use EP as a
signal to immediately move on to the next stage in the process and choose to recognise
the word with the highest activation at that point in time. In real conditions these
strategies would generally be reliable but in this experiment they may occasionally lead
the listener to make errors because, in fact, not enough of the word has been heard to
make the correct decision.
4.4.4.4 Implications for future research
There are a number of issues to consider here. One is that EP alignment varies between
speakers. If EP triggers immediate lexical access then this is unproblematic as the
listener will respond in the same way regardless of the speaker’s particular alignment. If,
however, EP does signal how many syllables are left in the word the listener will need
knowledge of any particular speaker’s alignment patterns before this information can be
used reliably. This knowledge could take the form of statistical regularities derived from
a small sample of speech or of a type of episodic memory.
Another issue is that the sentence-verification task used here may not be sensitive enough
to have revealed everything about the role of EP in spoken word recognition.
In
particular, it tells us little about the time course of its usage. There are several more
traditional methods that might help to distinguish when exactly EP is used. For example
the gating and cross-modal priming tasks might be appropriate. Gating would provide
information about the candidates generated before and after EP is heard and the
recognition time in relation to EP. Cross-modal fragment priming would show whether
relatives of both mono- and polysyllabic words are facilitated if a syllable is synthesised
with EP aligned appropriately for either one or the other.
The role of the plateau in spoken word recognition
144
Another problem is that only EP alignment is manipulated in this experiment whereas in
the natural data monosyllabic words and the segmentally identical initial syllables of
polysyllabic words are also distinguished by factors such as duration (shorter in
polysyllables) and the rate of change of the fall (generally shallower in polysyllables).
The results gained by manipulating EP alone suggest that EP itself does play a role in
word recognition but a full picture cannot emerge until the roles of these other factors
have been identified. It is likely that the inclusion of additional distinguishing prosodic
variables will further support the case for the role of detailed prosodic information in
word recognition.
An additional problem is that the plateau has only been studied for nuclear falls in broad
focus declaratives. Therefore it is difficult to know what role alignment might play in
prenuclear accents or when the speaker uses a different intonation pattern. It could in fact
be the case that EP is a phrasal property marking the distance until the end of the
utterance rather than the end of the word or foot; in this case it would only play a role in
nuclear position. This proposal is unlikely however because post-nuclear tails can be
very long.
A final problem is that SP was also manipulated in this experiment as it occurs at a fixed
time before EP. Therefore we are not able to conclusively rule out SP as an important
factor. However, the particular contrasts used are based on the observed alignment of EP
and it is likely therefore that the results reflect this. To be certain, however, more
experiments are needed where SP and EP (and probably the peak too) are independently
manipulated.
The role of the plateau in spoken word recognition
145
4.5 Conclusions
It has been demonstrated that EP alignment does affect the recognition of words.
Specifically, polysyllabic words are recognised more quickly but less accurately when
they are resynthesised with an earlier, incorrect alignment. This is an interesting result as
only EP was manipulated although a host of other variables differ in the detailed prosodic
structure of mono- and polysyllabic words.
It seems that EP does mark linguistic
structure but probably not by virtue of forming part of the lexical entry for a word. More
likely is that EP acts as a factor additional to the segmental phonology of the word either
by triggering immediate lexical access or by allowing words with the correct number of
syllables to be favoured at the competition stage.
Download