The role of the plateau in spoken word recognition Chapter 4 110 The role of the plateau in spoken word recognition 4.1 Introduction Results from the previous production experiments, discussed in Chapters 2 and 3, suggest that the end of the plateau (EP) may mark linguistic structure. The alignment of this point within the accented syllable covaries with foot structure in the same way for all speakers but is unaffected by the non-structural factors pitch span and utterance type. It is conceivable that speakers use EP to signal aspects of linguistic structure such as whether or not there are more syllables to come in the word or foot. However, although it seems EP covaries with linguistic structure in production it is not known whether its alignment facilitates the listener’s task in any way. Thus, the aim of the present experiment is to see whether listeners can make use of the alignment of EP in speech processing. As EP alignment covaries with the number of syllables in the foot, one conceivable hypothesis is that the listener may attend to the alignment of EP within the syllable and use this information in the process of spoken word recognition. In order to ascertain whether EP does indeed mark linguistic structure and facilitate processing we need to be able to experimentally manipulate the alignment of this point within the syllable whilst holding other factors, such as the length of the preceding plateau and the rate of change of the fall, constant. The technical details of such manipulations are made easy by resynthesis but it is notoriously difficult to design experiments that evaluate the function of intonational contrasts. There are many issues to consider in the design of both the stimuli and the task. The role of the plateau in spoken word recognition 111 4.1.1 Issues in designing stimuli On the face of things it may seem very easy to design stimuli for experiments concerning intonation. For example, one could compare responses to stimuli sensitive to the feature under test and stimuli insensitive to this feature. So for example we could compare one set of stimuli with EP alignment modelled on natural speech, and thus sensitive to prosodic structure, to another set with a random alignment of EP. This approach is doubly problematic, however. Firstly, it is difficult to interpret the results from such an experiment. Any significant results could be attributed only to the fact that listeners are distinguishing between natural and unnatural patterns in the stimuli rather than having their perception directly affected by the experimental factor in question. So, for example, the set of stimuli with EP sensitive to prosodic structure may be processed more efficiently merely because it is more natural and not because the alignment of EP facilitates processing. Secondly, even when such results can be considered to be reliable, they reveal very little about how the perceptual system processes natural speech, as the random stimuli have been created using unnatural contrasts. Another general difficulty encountered when designing stimuli for perceptual experiments concerning intonation is that there is, of course, no one-to-one match between intonation contours and meaning. Thus if intonation patterns are changed this may create a different, and yet still perfectly acceptable, natural intonation contour. In addition, intonation signals many different functions in English. For example it gives information about grammatical structure, sentence type and focus as well as the attitudes and emotions of the speaker. Therefore it is crucially important to resynthesise stimuli in such a way that they will not sound unnatural or change the meaning of the utterance. The role of the plateau in spoken word recognition 112 As the point of interest here is the alignment differences between EP in monosyllabic and polysyllabic feet it will be interesting to see what happens to processing when this alignment is changed. In order not to create stimuli that may lead to uninterpretable results one solution is to investigate how processing is affected when two naturally attested patterns are exchanged. Therefore, in this experiment processing will be examined when the syllable in a monosyllabic foot is synthesised with EP aligned appropriately for the accented syllable in a polysyllabic foot, and vice versa. This is a similar type of stimuli to that used by Ogden et al. (2000) (as described in Chapter 1) where monosyllabic words were synthesised with correct or incorrect alignment. In this experiment, however, stimuli will contain both mono- and polysyllabic words. One important issue to consider is whether the duration of the syllable should be a factor in the present experiment. As we have seen, alignment differences are in part due to the different durations of the accented syllable in mono- and polysyllabic feet. For this experiment, a decision was made not to include duration as a factor, as a significant result found for stimuli containing altered alignment alone would show that alignment itself is important regardless of the duration of the accented syllable. 4.1.2 Issues in designing the task It is common when testing intonation for speech synthesis systems to use subjective measures. For example, in Mean Opinion Score (MOS) tests, listeners are asked to rate (usually on a five-point scale) whether the intonation is acceptable or natural. This is a perfectly reasonable method for speech synthesis systems where the main measure of quality is indeed the opinions of clients. If we want to evaluate the function of intonation, however, it is important to use objective tasks that will tap directly into the particular cognitive processes that we believe a contrast may affect (cf. Duffy and Pisoni 1992). The role of the plateau in spoken word recognition 113 As it is suggested that EP is a marker of low-level linguistic structure, indicating differences in word or foot length, it is important to employ a task that will tap into processing at that particular level. Many methods are currently used for examining the processes of spoken word recognition such as gating, priming and word spotting. A simplified version of a sentence-picture verification task was chosen for use in this experiment. Clark and Chase (1972) report a series of experiments where subjects compare a written sentence with a picture. Pictures in these experiments were always simple representations containing two geometric symbols, one above the other. Subjects made truth judgments to affirmative (‘star is above plus’) and negative (‘star isn’t above plus’) statements and their reaction times and errors were recorded. Results from these experiments led to a theory of sentence-picture verification that states that both elements (sentence and picture) must be represented in the same form and then compared to each other in order for the task to be completed. Each step of the process takes a fixed time and therefore extra operations add more time to response latencies. As in all reaction time experiments, more difficult tasks are indicated by longer latencies and a higher error rate. The sentence-picture verification task is suitable for a number of reasons. Firstly, it allows for the measurement of both reaction time and errors, both of which are measures of processing efficiency. In addition, it allows for the presentation of a whole sentence rather than a single word, which is important as EP alignment must be considered as part of a larger prosodic hierarchy. Also, as much is known about the results that can be expected from this task with non-manipulated stimuli, it will be easy to tell whether the test is sensitive or not when used with manipulated stimuli. Finally, the task is simple for subjects to complete and generally leads to a relatively low error rate, so that any errors that are made are interpretable. The role of the plateau in spoken word recognition 114 4.1.3 Hypothesis The hypothesis is that an appropriate alignment of EP will lead to faster reaction times and fewer errors than an inappropriate alignment because processing will be easier. This hypothesis entails several assumptions about the processes involved in recognition. The first is that subjects will process the intonation contour at an adequate level of detail and will be able to use the information. The second is that information about alignment forms a part of the mental representation of the word and can thus be matched or checked against alignment information in the signal. It is unlikely that EP alignment actually forms part of a dictionary entry for a word, as its alignment is predictable from structure, so the mental representation may take the form of a dictionary entry plus rules for the alignment of EP depending on the structure of the word. These issues will be discussed in further detail in section 4.4.4.2, which deals with previous experimental studies of the use of intonation in word recognition, and section 4.4.4.3, which considers the incorporation of prosody into current models of word recognition. 4.2 Method 4.2.1 Material 4.2.1.1 Auditory material 4.2.1.1.1 Items Ten words (with varied onset and coda types) were chosen which are entire monosyllabic words in their own right as well as forming the first stressed syllable of a polysyllabic word. For example the monosyllabic word ‘cat’ also forms the first stressed syllable of the polysyllabic word ‘catamaran’. All the words chosen are imageable nouns, as the sentence-picture verification task requires that they be pictorially represented. The items are shown in Table 4.1. The role of the plateau in spoken word recognition Monosyllabic Cat Guard Guide Key Pan Pea Toad Train Jug Nail 115 Polysyllabic Catamaran Garden Guidedog Keyhole Panther Peacock Toadstool Trainer Juggler Nailfile Table 4.1 Items used in the experiment 4.2.1.1.1.1 Recording and measurement of natural utterances A female native speaker of SBE recorded each of the target words in the carrier sentence ‘It’s a picture of a ____’ with falling intonation on the target word. The recordings were made in a sound-treated room using Cool Edit and a high-quality microphone. Measurements were taken from these utterances to ensure that they followed the pattern found previously, specifically that the alignment of EP within the accented syllable was later in polysyllabic than monosyllabic words. EP was identified using methods described for previous production experiments. The duration of the accented syllable was measured and alignment calculated as a percentage of syllable duration from the beginning of the syllable, again using the same method as in previous production experiments. The rate of change (in Hertz per second) was determined by identifying the low tone, measuring the difference in Hertz between EP and L, and dividing this distance by the difference in seconds between the two points. A schematic representation of these measurements can be seen in Figure 4.1, and the results of these measurements are given in Table 4.2. In general the patterns found mirror those from the production experiments in Chapters 2 and 3; EP is aligned later in a shorter syllable in polysyllabic words. There is also a trend for the rate of change in the fall to be greater in the monosyllabic word. The role of the plateau in spoken word recognition Item Cat Guard Guide Key Pan Pea Toad Train Jug Nail Monosyllabic Syllable EP duration alignment (ms) (% of syllable) 574 34 484 30 524 20 472 45 551 31 448 40 556 29 610 32 566 22 403 42 Rate of change (Hz/s) 367 185 203 384 188 516 328 260 369 267 Polysyllabic Syllable EP duration alignment (ms) (% of syllable) 287 46 373 34 323 49 289 63 370 46 235 83 295 45 356 53 247 36 314 77 Item Catamaran Garden Guidedog Keyhole Panther Peacock Toadstool Trainer Juggler Nailfile Table 4.2 Measurements taken from natural utterances EP Hz L ms Syllable duration (ms) Figure 4.1 Schematic representation of measurements taken 116 Rate of change (Hz/s) 181 112 124 491 447 279 120 281 317 198 The role of the plateau in spoken word recognition 4.2.1.1.1.2 117 Resynthesis All resynthesis was carried out in PRAAT. A single carrier sentence was resynthesised by specifying turning points at seven points in the intonation contour. Within each target word the pitch contour was specified at four points and was then spliced onto the resynthesised version of the carrier sentence. As discussed above in section 4.1.1, the design of the experiment involves swapping the alignment of EP between mono- and polysyllabic words. To this end two versions of each target word were created. The alignment of EP in each case was determined by whether the utterance was to have correct or incorrect alignment. So, in a correct version of ‘cat’ EP is aligned naturally for ‘cat’ whilst in an incorrect version EP is aligned at the percentage of the syllable found in ‘catamaran’. The situation is reversed for correct and incorrect versions of ‘catamaran’. The frequency of EP was always the same as in the natural utterance, regardless of whether a correct or incorrect version was created. In order to create a plateau another pitch point, representing SP, was added 70 ms (the average plateau duration for this speaker) before EP at the same frequency. A further pitch point was added 80 ms before SP to create the rise to the plateau. The alignment and frequency of the turning point approximating the low tone were specified so as to maintain the rate of change found for the natural utterance. So, for example, ‘cat’ always had the natural rate of change for ‘cat’ regardless of whether the EP was located for a correct or an incorrect utterance. These factors are shown below in Table 4.3. Crucially, incorrect versions of each item differ only in the alignment of the plateau (based on EP) and not in terms of any other potentially distinguishing factor. Correct Incorrect Duration of syllable Cat Cat EP Alignment Cat Catamaran EP Frequency Cat Cat Table 4.3 Factors for correct and incorrect versions of ‘cat’ Rate of Change Cat Cat Low Tone Frequency Cat Cat The role of the plateau in spoken word recognition 4.2.1.1.2 118 Fillers Ten pairs of fillers were chosen; these are shown in Table 4.4. Every filler was an imageable noun. The members of a pair were either phonologically or semantically related. Although not modified in any way, fillers were all resynthesised so they would sound consistent with the test items and were then spliced onto the same carrier phrase used for test items. Burger Gibbon Television Table Pen Ring Apple Saw Bow Tank Burglar Ribbon Telephone Chair Pin Herring Pineapple Jigsaw Elbow Fishtank Table 4.4 Fillers used in the experiment 4.2.1.2 Picture material A selection of pictures was chosen that were considered to be good visual examples of each target word. In an informal rating experiment ten native speakers (different from those in the main experiment) rated these pictures on the degree to which each was a typical exemplar of the target word. The picture rated highest for each word was then given to ten different subjects (again different from those in the main experiment) who were asked to provide a label. Pictures were chosen for use in the final experiment only if they were labelled correctly by seven of these subjects. 4.2.2 Subjects Forty subjects were tested. All were monolingual speakers of British English residing in Cambridge at the time of the experiment. None reported hearing or speech problems and none were epileptic. Their ages ranged from 18 to 50 (mean 30). Thirty-eight were right handed and two were left-handed. Subjects were paid a small fee for their time. The role of the plateau in spoken word recognition 119 4.2.3 Training 4.2.3.1 Auditory training Previous production experiments have shown that although all speakers align EP later in the accented syllable in polysyllabic than monosyllabic feet, the exact percentages are not the same for each speaker. For this reason, subjects were given auditory training designed to expose them to this particular speaker’s voice and acclimatise them to the particular patterns used to distinguish mono- and polysyllabic words without exposing them to the target items themselves. Smith (2003) indicates that listeners need only a small amount of training to become aware of linguistically-relevant talker characteristics and thus the length of the auditory training was approximately three minutes. Subjects listened to a fairy story (reproduced in Appendix C), read by the speaker of the test sentences, over headphones. The story contained ten pairs of words similar to those used in the main experiment, in that one member was a monosyllabic word and the other had that word embedded as the first strong syllable of a polysyllabic word. Members of each pair were far apart in the story to minimise the likelihood of subjects being alerted to the purpose of the experiment. 4.2.3.2 Picture training It was important to ensure that any differences in subjects’ responses were only due to the auditory component and were not affected by factors associated with the pictures. This had been partially controlled for by taking care to select the best picture to represent each target word as described in section 4.2.1.2. However, it was still possible that some pictures were better exemplars than others and also that different subjects might have different prototypes for the same concept, which could cause unwanted differences in subject responses. For this reason, subjects were trained to associate each picture with the correct target word or filler. In this training session they were shown a picture on a computer screen followed by a one-word label for that picture. They were shown each target picture and label three times (60 in all) and also saw the same number of filler items. Target and filler items were randomised together, with a different randomisation being created for each presentation for each subject. 120 The role of the plateau in spoken word recognition Next, subjects were shown the same pictures (again in a different random order) without the label and asked to tell the researcher the label they remembered. If any mistakes were made they were shown each picture and label again before being retested on their recall. This procedure was repeated until they could remember all the labels. Most subjects could remember all the labels after the initial three presentations. 4.2.4 Setup and instructions Subjects were tested individually in a sound-treated room. Training and testing took place in a single session and all parts of the test were run using DMDX (http://www.u.ariozona.edu/~kforster/dmdx/dmdx.htm) on a PC. Picture stimuli were shown centralised on the screen and utterances were heard over headphones. The light in the room was dimmed so it did not reflect onto the screen. Subjects were asked to sit close to the desk so that they were comfortable. Each version of each target word was heard once paired with the target picture (to elicit a true response) and once paired with the picture representing the opposite member of the pair (to elicit a false response). Items and fillers were pseudo-randomised together with the condition that no picture or utterance occurred in two successive presentations. Table 4.5 shows the different presentations of each target word. Target Word Cat Cat Cat Cat Synthesis Version Correct Correct Incorrect Incorrect Table 4.5 The four presentations of each target item Picture Cat Catamaran Cat Catamaran Truth-value True False True False The role of the plateau in spoken word recognition 121 The left and right shift keys of the keyboard were labelled T (true) and F (false). The same labels were placed on either side of the screen at eye-level. Subjects were instructed to respond ‘true’ if the target word matched the picture and to respond ‘false’ if it did not. Half of the right-handed subjects (16 in total) and half of the left-handed subjects (1 subject) responded ‘true’ with their right hand. The other subjects responded ‘true’ with their left hand. Subjects were instructed to keep their index fingers resting on the appropriate keys at all times and to respond as quickly and as accurately as possible. There was a break after every 30 responses and subjects could start the experiment again in their own time. Subjects were given a short practice test before the main experiment in order to familiarise them with the format of the experiment. In this practice they saw each of the filler items twice, once paired with the appropriate picture and once paired with the picture for the other member of the pair. On average subjects took about 20 minutes to complete the experiment and 30 minutes to complete the training. 4.3 Results 4.3.1 Statistical procedure Reaction times for correct responses were measured from the beginning of the target syllable and errors were counted. Responses that took longer than three seconds were considered to be non-responses. A log transform was applied to reaction times in order to normalise the data enabling the use of parametric statistical procedures. For both reaction times and errors, means were calculated for each combination of the variables foot type (monosyllabic or polysyllabic), alignment (correct or incorrect) and truth-value (true or false) making eight dependent variables for input into repeated measures (MANOVA) analysis. Each dependent variable was created from all ten relevant responses from each subject unless data points were missing (due either to the subject not responding, responding too slowly or responding incorrectly) in which case means were calculated only on the basis of the remaining data points. The role of the plateau in spoken word recognition 122 4.3.2 By subject 4.3.2.1 Reaction times Response times to monosyllabic items are significantly faster than those to polysyllabic items (F(1,39) = 2.833, p=0.05) as shown in the upper panel of Figure 4.4 on page 127. Response times to true utterances are significantly faster than those to false utterances (F(1,39) = 58.00, p<0.01) as shown in the upper panel of Figure 4.2 on page 125. There is no significant effect of alignment (F(1,39) = 0.3635, p>0.05) as shown in the upper panel of Figure 4.3 on page 126. There was a significant interaction between alignment and truth (F(1,39) = 3.22, p<0.05). This was further analysed by means of t-tests, which indicated that responses to true items were quicker with incorrect than correct alignment (t (39) = 1.715, p<0.05). The difference in reaction times between mono- and polysyllabic items implies that these should be analysed separately. To this end two further repeated measures analyses were conducted for mono- and polysyllabic items separately. Each analysis had factors of truth and alignment. Results for alignment are shown in Figure 4.4. Reaction times to monosyllabic items are quicker when the proposition is true than when it is false (F(1,39) = 46.950, p<0.05). Reaction times are not significantly affected by alignment although there is a trend for reactions to be quicker with correct alignment (F(1,39) = 0.285, p>0.05). Polysyllabic items are also responded to faster when the proposition is true than when it is false (F(1,39) = 42.255, p<0.01). Interestingly there is also a weakly significant trend for reaction times to be quicker with incorrect alignment (F(1,39) = 2.516, p=0.06) and it is this result that leads to the overall finding of quicker reaction times to true items with incorrect alignment. The role of the plateau in spoken word recognition 123 4.3.2.2 Errors Overall, the error rate is very low. Out of 3,200 responses there were only 60 errors, a rate of 1.9%. This is a common finding for sentence picture verification tasks, for example Clark and Chase (1972: 492) find error rates to be between 4.7% and 12.8% for positive sentences when the picture is presented first. There were more errors to monosyllabic than polysyllabic items (F(1,39) = 5.151, p<0.05) as shown in the lower panel of Figure 4.4. There were more errors to false than true propositions (F(1,39) = 3.136, p<0.05) as shown in the lower panel of Figure 4.2. There was a weakly significant effect of alignment with more errors made to incorrect versions than correct versions (F(1,39)=2.145, p=0.08) as shown in the lower panel of Figure 4.3. In the same way as for reaction times, mono- and polysyllabic items can be analysed separately. For monosyllabic items, errors are not affected either by the truthvalue of the proposition (F(1,39)=0.188, p>0.05) or by the alignment of EP (F(1,39) = 0.526, p>0.05). Interestingly, more errors are made to polysyllabic items when the proposition is false (F(1,39) = 2.951, p<0.05) and when incorrect alignment is used (F(1,39) = 3.690, p<0.05). 4.3.3 By item A further analysis was conducted to examine the differences in reaction time and errors between correct and incorrect versions of each item. This analysis used paired t-tests with a single factor of alignment over items rather than over subjects. Overall, reaction time is not significantly affected by the alignment (t(19) = 0.089, p>0.05). There is also no significant effect when monosyllabic (t(9) = 0.426, p>0.05) and polysyllabic items (t(19)=0.532, p>0.05) are analysed separately. Errors show a weakly significant effect overall: more errors are made when the incorrect alignment is used (t(19) = 1.629, p=0.06). Monosyllabic items show no significant effect of alignment on errors (t(9) = 0.410, p>0.05) but there are more errors to polysyllabic items synthesised with incorrect alignment (t(9)=2.400, p<0.05). version of each item is shown in Table 4.6. The number of errors made to each The role of the plateau in spoken word recognition Monosyllabic Item Cat Guard Guide Key Pan Pea Toad Train Jug Nail Total Correct 0 1 2 5 3 0 0 1 2 3 17 124 Polysyllabic Incorrect 0 5 2 1 2 1 3 0 1 5 20 Item Catamaran Garden Guidedog Keyhole Panther Peacock Toadstool Trainer Juggler Nailfile Table 4.6 Number of errors made in response to each target word Correct 0 0 0 1 0 1 2 1 1 0 6 Incorrect 1 3 1 5 1 0 3 1 1 1 17 The role of the plateau in spoken word recognition 125 Figure 4.2 Response times (top panel) and number of errors (bottom panel) to true and false propositions, pooled over subjects The role of the plateau in spoken word recognition 126 Figure 4.3 Reaction times (top panel) and errors (bottom panel) to items with correct and incorrect alignment, pooled over subjects The role of the plateau in spoken word recognition 127 Figure 4.4 Reaction times (top panel) and errors (bottom panel) to mono- and polysyllabic items with correct and incorrect alignment, pooled over subjects The role of the plateau in spoken word recognition 128 4.4 Discussion The results have shown that there are interesting effects of alignment and also effects of foot type and truth-value. I will deal first, with the final two of these results, which are essentially side issues, before returning to the main issue of alignment in section 4.4.3. 4.4.1 Foot type Firstly, monosyllabic words are responded to more quickly than polysyllabic words. This is a typical result in experiments on spoken word recognition, at least when responses are measured from target word onset rather than offset. For example, Grosjean (1980) investigated the effects of word length based on the number of syllables in a word. A gating experiment was used in which successively longer portions of a word are presented and after each presentation subjects are asked to say what they think the word is and to indicate how confident they are about their answer. Results showed that polysyllabic words were identified at larger gate sizes (i.e. when more of the word had been presented) than monosyllabic words (although monosyllabic words were often not identified until after their acoustic offset). In addition, Craig and Kim (1990) investigated the effects of word duration per se. Again using a gating procedure, they measured the average number of gates needed to identify words of different durations. They found that longer words are not identified until more of the word has been presented than is necessary to identify shorter words. Also, the proportion of the word presented (and not just the total duration) has to be greater in order for longer words to be identified. Shorter words however, were often not identified with a high confidence level, even at the final gate. As the present experiment also found that responses to monosyllables were quicker than those to polysyllables, this is an indication that the sentence-picture verification task used is sensitive. 129 The role of the plateau in spoken word recognition Another factor, other than word length itself that may affect reaction times to mono- and polysyllabic words is frequency of occurrence. Generally in English, polysyllabic words are of lower frequency of occurrence than monosyllabic words. A survey of the British National Corpus (http://sara.natcorp.ox.ac.uk) reveals that this is the case for all the pairs in this experiment except ‘guard’ and ‘garden’. These results, taken from unlemmatised counts of both written and spoken material, can be seen below in Table 4.7 where the number of occurrences per million words is shown for each item. It is very difficult to control for frequency effects in an experiment of this type where structural detail is so important and it would be almost impossible to find words of equal frequency that also meet the other requirements for items. Nevertheless, the different frequencies of occurrence for mono- and polysyllabic words probably contribute to the differences in recognition time. Monosyllabic Item Cat Guard Guide Key Pan Pea Toad Train Jug Nail Polysyllabic Frequency per million 3801 483 3844 2375 1183 151 221 5929 547 494 Table 4.7 Frequencies for items in the experiment Item Catamaran Garden Guidedog Keyhole Panther Peacock Toadstool Trainer Juggler Nailfile Frequency per million 76 9537 Unattested 122 123 135 26 971 24 Unattested The role of the plateau in spoken word recognition 130 4.4.2 True and false propositions Reaction times to true propositions are quicker than those to false propositions. This is a very common finding in sentence-verification tasks. Clark and Chase (1972) demonstrate that in response to affirmative statements, like the ones used in this experiment, subjects are quicker to respond correctly to a true than to a false utterance. This extra ‘falsification time’ (Clark and Chase 1972: 483) is a robust finding in the literature. Clark and Chase’s (1972) explanation for the effect is that subjects initially assume that all sentences are true and will only decide they are false when there is contradictory evidence. Thus, deciding an utterance is false adds an extra step to the process of verification, as the subject must change the value of a truth index from true to false. In turn this extra step adds a fixed amount of extra time to the subject’s reaction. Clark and Chase (1972) find falsification time to be 142 ms when pictures are presented before sentences. In the present experiment true utterances are verified in 1700 ms on average whereas false utterances are verified in 1792 ms, a difference of 92 ms. This is quite close to the Clark and Chase (1972) result and may be shorter because this task is simpler, requiring the evaluation of only one noun rather than two in the earlier study (where the nouns were always ‘star’ and ‘plus’). There may also be an effect of sentence medium as Clark and Chase (1972) used written sentences as opposed to the spoken sentences in this experiment. The presence of the falsification effect in this experiment again indicates that the task does work and is an effective way to measure processing. 4.4.3 Alignment Although the effect of alignment is not significant overall we have seen that an interesting effect emerges when the two foot types are separated. Reaction times to monosyllabic items are unaffected by alignment although there is a trend for responses to be quicker with correct alignment. The situation is reversed for polysyllabic words however: there is a weakly significant trend for reaction times to be faster when polysyllabic words are synthesised with the incorrect alignment. This result is contrary to that predicted and seems at first sight to also be incompatible with the original hypothesis that items will be processed more quickly with appropriate alignment. The role of the plateau in spoken word recognition 131 However, if we think of the alignment differences in the stimuli in terms of ‘early’ and ‘late’ alignment rather than ‘correct’ and ‘incorrect’ the situation is clearer. For polysyllabic items, incorrect alignment means alignment earlier in the syllable than would occur naturally. Taking the pair ‘cat’ and ‘catamaran’ as an example, Table 4.2 shows that the duration of ‘cat’ is 574 ms. Therefore in correct versions alignment at 34% of the syllable puts EP at 195 ms whereas alignment in incorrect versions is later at 46% or 264 ms. In ‘catamaran’ the situation is reversed. The duration of the accented syllable is 287 ms and correct alignment puts EP at 46% or 132 ms. In incorrect versions however alignment is earlier at 34% or 98 ms. It is plausible therefore that it is this earlier EP rather than an appropriate EP alignment that speeds up reaction times. The exact mechanisms of how an early EP might contribute to the process of recognition are discussed below in section 4.4.4.3. The effects of alignment on errors also differ between mono- and polysyllabic items. An exploration of the data for monosyllabic items alone reveals that the errors are somewhat random as they are unaffected either by the alignment of EP or by the truth-value of the proposition. This is an example of a speed-accuracy trade-off. Monosyllabic words, as we have seen, are responded to more quickly than polysyllabic items, but are correspondingly also more error prone. Comparing the open and closed circles in Figure 4.5 reveals that, for monosyllables, there is little difference in speed or number of errors when different alignments are presented. The story for polysyllabic items is rather different. In this case there are significantly more errors when alignment is incorrect than when it is correct as a comparison of the open and filled triangles in Figure 4.5 shows. We have already seen that responses to polysyllabic words are quicker when the incorrect intonation is used and have postulated that the earlier EP under these conditions may speed recognition. It is possible that this early EP may also lead subjects to make errors caused by the erroneous belief that enough of the word has been heard to make a correct response. However, this cannot always be the case. Although subjects do make errors to polysyllabic items when the wrong alignment is used, the majority of their responses are correct. This suggests that early alignment facilitates speed of recognition, usually allowing for a correct judgement to be made but on some occasions leading to errors. The role of the plateau in spoken word recognition 132 Why would an early EP only sometimes lead to errors being made? Whether an error is made or not may depend to a large extent on the durations of the mono- and polysyllabic words in question and the alignment of the individual EPs. Imagine for example that there is a big difference between the durations of the target syllable in mono- and polysyllabic words. In the incorrect version of the polysyllabic word EP will occur early when subjects may have heard very little of the syllable. In this case they might have missed out on other important cues to word identity such as coarticulatory effects. When the syllables in mono- and polysyllabic words are not so different in duration however, the listener has indeed heard enough of the word and therefore an early EP allows for quick and accurate word recognition. Monosyllabic words are not slowed down by a late EP, however, because other cues have already allowed their recognition before EP is heard. Figure 4.5 Speed-accuracy trade-off (mean number of errors and reaction time for each condition, pooled over subjects) The role of the plateau in spoken word recognition 133 4.4.4 The role of prosody in spoken word recognition It is important to discuss how these findings fit in with other research on spoken word recognition. In section 4.4.4.1 I will describe three models of spoken word recognition and discuss how these models do not include a prosodic component. Section 4.4.4.2 will consider previous experimental evidence concerning the role of prosody in word recognition whilst section 4.4.4.3 will consider how models might be modified to make use of prosody in the light of the findings presented in this chapter. 4.4.4.1 Models of spoken word recognition Spoken word recognition is generally assumed to be accomplished by activation and competition of words in the mental lexicon. Models of spoken word recognition may be distinguished in various ways such as the number of stages involved, the use of top-down information and the methods used to identify word boundaries. In the Cohort model (Marslen-Wilson and Welsh 1978), activation and competition are considered to be separate stages. The activation stage involves setting up a word-initial cohort based only on the first 150 ms of the word whilst the competition stage involves the elimination of any words in this cohort that do not match further incoming sensory information. The word that is recognised is the final word left in the cohort after the others have been eliminated. Newer versions of the model (e.g. Marslen-Wilson 1993) maintain the same principle but the elimination stage involves a gradual decay of activation for non-matching items rather than complete elimination and words are recognised when one reaches a particular threshold. In both older and newer versions of the model words are recognised sequentially so that the recognition of one word allows a new cohort to be set up for the next. The earlier version allows for the use of top-down information at all stages whilst the later versions restrict this use to a post-lexical stage of integration where the syntactic and semantic properties of the word are accessed. The role of the plateau in spoken word recognition 134 Connectionist models such as TRACE (McClelland and Elman 1986) do not separate the activation and competition stages of the process. TRACE is a highly interactive connectionist network that makes heavy use of top-down information. TRACE consists of three levels of nodes corresponding to phonological features, phonemes and words. Activation of one node spreads to compatible nodes on other levels whilst inhibiting the activation of nodes on the same level, so for example, the activation of one node on the word level inhibits the activation of other word nodes but facilitates activation of compatible phonemes and features. A word is recognised when there is stable activation in the network and only one word node is activated. TRACE deals with the problem of identifying word boundaries by treating every part of the speech stream as a potential word onset and setting up a new network at every time-slice of the speech stream. In this way, not only do words with the same initial phonemes compete with each other (as in the Cohort model), but words beginning in different ways also compete. The SHORTLIST model (Norris 1994) contains elements of both the Cohort model and TRACE. Activation and competition are separate stages, with activation being a bottomup process, which creates a shortlist of candidates in some ways like the initial stage of the Cohort model. In this model, however, the shortlist contains candidates beginning at every phoneme in the input rather than only words matching the first 150 ms of input. The competition stage is reminiscent of the lexical level of TRACE, whereby a spreading activation network allows the words in the shortlist to compete with each other using both bottom-up and top-down information. Word boundaries are detected in the same way as in TRACE (by treating every phoneme as a potential word onset) but, as activation and competition are separate, without the extra computation needed to create a new network at each time slice. The role of the plateau in spoken word recognition 135 All three of these models of word recognition have little to say about the role of prosody. This apparent deficiency is due to many factors. Firstly, models of visual word recognition, in which prosody obviously does not play a part, have often influenced the development of models of spoken word recognition. Secondly, models of spoken word recognition often consider only the recognition of monosyllables, in which prosodic information is limited. Thirdly, the input to these models is often rather unrealistic, taking the form, for example, of a string of phonemes or phonological feature bundles. This also makes it difficult to allow for the inclusion of prosodic information. Finally, there is no real consensus on the role played by prosody in the recognition process. The next section will discuss some experimental findings that highlight this debate. 4.4.4.2 Experimental studies of prosody in spoken word recognition The role suggested for prosody in spoken word recognition depends to some extent on the language in question and also upon the particular model of word recognition or theoretical framework under consideration. For example, one suggestion for the role of prosody arose to address the deficiencies of sequential recognition models such as the Cohort model (e.g. Marslen-Wilson and Welsh 1978). In the Cohort model words may, in theory, be recognised before their acoustic offset, allowing very fast recognition. This is possible if a word’s uniqueness point (the point at which it becomes different from any other word in the lexicon) is before the acoustic offset, making it the only word remaining in the cohort. The evidence from distributional studies, however, suggests that this simple version of the model will not work. Luce (1986) identified the uniqueness point for each of 20,000 words and found that it did indeed occur before the offset of the word in 60% of cases. However, in English and other languages, short words are very often embedded in longer words as pointed out by, for example, McQueen and Cutler (1992). So, in the present experiment for example, ‘cat’ is embedded in ‘catamaran’, and so on. This embedding means that the uniqueness point for short words like ‘cat’ is often after their offset. Short words are also more frequent in the language than long words. When Luce (1986) took frequency into account he found that the probability of a word’s uniqueness point being before its offset was only 0.39. The role of the plateau in spoken word recognition 136 It is in this respect that prosody has been considered to be important in the recognition process. If we consider that at the activation stage the listener is sensitive not only to the segmental phonology but also to the prosodic structure of the utterance then it is likely that there will be more success in identifying the correct word quickly. As the number of words in the lexicon matching both the segmental phonology and the prosody of the input will be small, there will be very few words considered at the competition stage, allowing the correct word to be isolated quickly. Work by Winfield et al. (1997) indicates that this explanation may be tenable. In a gating experiment, where successively larger portions of a word are played, subjects recognised words at gate sizes where the cohort size was still very large if only segmental phonology was considered. When the cohorts were re-estimated to take prosody into account, however, they were actually much smaller and could potentially explain the subjects’ ability to recognise words so early. One difficulty in interpreting the result from this experiment however is that prosody is taken to be a whole collection of parameters including vowel quality, intonation, amplitude and duration. The role of the plateau in spoken word recognition 137 Work by Cutler and colleagues (e.g. Cutler and Norris 1988) suggests that vowel quality information is crucially important in the recognition process. Cutler and Norris’ (1988) Metrical Segmentation Strategy suggests that vowel quality is used to explicitly segment the speech stream. This work is based on the observation that most words in English have a strong first syllable (where vowels are unreduced and the syllable can potentially carry stress). Therefore, strong syllables could be a good basis on which to initiate lexical access as the listener can assume that a strong syllable in the input is probably wordinitial. This view has been largely supported by experimental work. Cutler and Butterfield (1992), for example, induced misperceptions by playing subjects unpredictable sentences at low intensities. They found that word boundary insertions in the misperceptions were more common before strong syllables whereas word boundary deletions were more common before weak syllables. For example the sentence “conDUCT asCENTS upHILL” (where capitals indicate a strong syllable) might be (mis)perceived as “the DOCtor SENDS her BILL” (Cutler and Butterfield 1992: 228). Spontaneous mishearings in which word boundaries are inserted before strong syllables, for example perceiving “a meCHANical HORSE” as “I’m a CANNibal HORSE” (author’s unpublished corpus), are common. Statistical analyses of such corpora (e.g. Cutler and Butterfield 1992) also support the Metrical Segmentation Strategy. Cutler (1986) goes further, by stating that it is only vowel quality and not other features of stress that affect word recognition in English. For example a cross-modal associate priming experiment demonstrates that presentation of either member of minimal stress pairs (such as ‘ForBEAR’ and ‘FORbear’) primes words associated to both members. Cutler (1986) suggests that the suprasegmental differences between such pairs are not able to constrain lexical activation and as such these pairs are homophonous. The role of the plateau in spoken word recognition 138 This under-use of suprasegmental cues for word recognition appears not to be found in other languages. Cutler and van Donselaar (2001) investigate whether pairs such as ‘VOORnam’ (first-name) and ‘voorNAM’ (respectable) are phonetically equivalent from the point of view of lexical access (or ‘homophonous’ in Cutler’s terms), in Dutch. In one experiment, an auditory lexical decision task is used to show that one member of this minimal pair does not facilitate responses to the other (unlike in English) although repetition priming (where one member of the pair facilitates recognition of itself) is reliably found. This suggests that suprasegmental cues are used in the recognition process in Dutch. Why should it be the case that Dutch listeners use suprasegmental cues and English listeners do not? Dutch and English are very similar in that both are stress timed, have an opposition between strong and weak syllables and have lexical stress. Cutler and colleagues (e.g. Cutler et al. 1997) suggest that the different findings for different languages are related to differences in word prosody. In English, there is a close correspondence between vowel quality and stress, in that stressed syllables always contain full vowels whilst unstressed syllables usually contain reduced vowels. This is not the case in many other languages, such as Dutch, where it is common to have full vowels in unstressed syllables. Cutler et al. (1997) suggest that in English, stress is unambiguously signalled by vowel quality, and therefore listeners are not advantaged by paying attention to other, suprasegmental, correlates of stress. In addition, information about word identity from segments may be available earlier than information from other sources. Cooper et al. (2002: 209) discuss how cues found within vowels already indicate the identity of the following consonant and can be used for word recognition. The role of the plateau in spoken word recognition 139 The view that English listeners do not use suprasegmental information for word recognition, preferring to rely on the cues to stress provided by vowel quality, has been rather pervasive in the literature. Nevertheless, an experiment by Lindfield et al. (1999) does attempt to assess the contribution of suprasegmental aspects of stress to word recognition in English. Using a similar method to Winfield (1997), three different gating experiments were conducted. The first was a standard gating experiment where the onset of a word was presented in increasingly larger fragments. In the second condition, the remainder of the word after the gate was filled in with white noise, informing the subject about the duration of the word. In the third condition, the remainder of the word was filled in by band-pass noise, informing the subject about the duration of the word and also the number and stress of the syllables. Results showed that subjects recognised words at smaller gates in the third condition than in either of the other two conditions whilst there was no significant difference between recognition points in the first two conditions. The authors interpret this result as showing that prosody (and seemingly only syllabic stress), is used to constrain the word initial cohort. Lindfield et al’s. (1999) result does not sit well with the more generally accepted view put forward by, for example, Cutler et al. (1997) that suprasegmental correlates of stress are not useful to English listeners. Recently however, it has again been suggested that the role of these suprasegmental correlates in English may have been underestimated. Cooper et al. (2002) demonstrate that in a cross-modal priming task, English listeners’ responses are facilitated by primes that are stress matched. This is the case for both one and two syllable primes. Thus, ‘ADmi-’ primes ‘ADmiral’ more than it primes ‘admiRATion’ and ‘MUS-’ primes ‘MUSic’ more than it primes ‘muSEum’. Cooper et al’s. (2002) results are further evidence that English listeners can indeed exploit suprasegmental information in word recognition. There is still, however, evidence that English listeners find this task more difficult than, say, Dutch listeners. It seems that English listeners can use information from a two-syllable prime more effectively than from a one-syllable prime. In the one syllable case there is also facilitation from a stressmismatched (but segmentally identical) prime whilst this is not the case for the disyllabic primes (Cooper et al. 2002). Dutch listeners only show a priming effect for primes that match in both stress and segmental identity. The role of the plateau in spoken word recognition 140 So far none of these results have much to say about the issue of intonation in word recognition other than the role it might play as part of a more general stress parameter. Recently however, event-related potentials (ERPs) have been used to try to illuminate this issue. Boecker et al. (1999) identified a negative going ERP 325 ms after word onset (N325) for words that contain a reduced vowel in German. N325 is not found when only a difference in stress is present (Friedrich et al. 2000) suggesting that vowel quality and other elements of prosody do indeed play different roles in speech processing. Friedrich et al. (2001) study the pitch contour as a single correlate of stress where a higher, rising F0 is associated with the stressed syllable. For German listeners they report a positive potential 200 ms after word onset (P2), which is larger when the first syllable of a disyllabic word is unstressed than when it is stressed. Friedrich et al. (2002) report that not only do stressed primes facilitate recognition of words with stressed first syllables (regardless of segmental identity) but that there is also a larger positive potential 350 ms after word onset (P350) when there is a stress mismatch. Results from these ERP studies are interesting, as they suggest that the pitch contour is extracted early on in the process of word recognition. Nevertheless, as we have seen, speakers of different languages may rely on different aspects of prosody to different extents, so it is not appropriate to extend these results from German to English listeners without further investigation. 4.4.4.3 A proposed role for prosody in models of spoken word recognition It seems, given the experimental evidence discussed above and the results from the work presented in this chapter, that prosody does indeed play some role in spoken word recognition. This section will consider how prosody might affect recognition in relation to the three models described above and what exactly the role of EP alignment might be. The role of the plateau in spoken word recognition 141 There are two obvious ways in which prosody might be used in spoken word recognition. In the first alternative prosody might operate at the activation stage of the process influencing which words are considered for recognition. This would have two implications for models. Firstly the input to models would have to include prosodic information rather than just a string of phonemes or features. This is easily accomplished in the Cohort model where prosodic information as well as segmental information in the first 150 ms of input could be considered. In SHORTLIST and TRACE the problem is more difficult to solve, as the inputs to these models are phonemes in Shortlist and phonological features in TRACE. A general solution is for a more realistic input or for additional channels or layers of nodes dealing specifically with prosody. The second implication of this approach is that the mental representation of a word must also have access to prosodic information. This information may form part of the entry for the word in the mental lexicon or, more likely, may exist as a set of rules that allow predictable aspects of prosody (such as the alignment of EP) to be established from phonological structure. The prosodic information, in whichever form it takes, will allow the word to be activated by any congruent information in the input. The second alternative is that prosody might be used at the competition stage. So, words might be initially activated on the basis of segmental phonology alone but then might receive more activation if they match the prosody of the input as well. McQueen et al. (1994: 235) suggest that the Metrical Segmentation Strategy could be incorporated into the SHORTLIST model in this way. The Metrical Segmentation Strategy would act as a bias in the competition process causing words with strong first syllables to receive more activation than those with weak first syllables. In this suggestion the input to models would not need to be enriched as words with strong first syllables receive extra activation regardless of the strength of syllables in the input. Lexical entries, however, need to be specified for syllable strength. A different version of this suggestion might be similar to that given in the first alternative above, namely that matching prosody biases the competition process by virtue of both richer lexical entries (or sets of rules) and a richer input. The role of the plateau in spoken word recognition 142 The results found for EP alignment, whilst suggesting that it is indeed used in spoken word recognition, do not seem to be accounted for by either of these two alternative explanations. It seems unlikely that EP alignment is used to constrain the items initially activated and considered for recognition. If this were the case, the number of items considered would potentially be very small. Depending on the fine-grainedness of the distinctions represented in the input and lexicon, there may be only one word activated and put forward to the ‘competition’ stage. This should mean that subjects make very fast decisions regardless of whether alignment is correct or incorrect as there are so few items left at the competition stage. It should also mean that subjects are always correct with correct alignment but always make errors with incorrect alignment. As we have seen however, this is not the case. Although subjects do make more errors when alignment is incorrect the number of errors is still very few. Conversely some errors are made when alignment is correct. These results suggest that alignment does not constrain the initial activation of items, which is not surprising given the rather variable alignment of EP between speakers. This theory would be particularly unworkable for the Cohort model as EP is not reliably found within the first 150 ms of the word, which is considered to be used as the basis for activation. It is also unlikely that EP acts as a bias at the competition stage by virtue of being represented in lexical entries (or by rules that assign its alignment on the basis of structure). If this hypothesis were correct, items with matching alignment would be activated more than items with non-matching alignment and responses to these more highly activated items would be quicker. As we have seen, however, there is no overall effect of alignment. Even when responses to mono- and polysyllabic items are separated there is no affect of alignment for monosyllables and for polysyllables responses are actually quicker with incorrect alignment. This suggests that EP in the input is not matched to EP in lexical entries at the competition stage. The role of the plateau in spoken word recognition 143 A final alternative might be that the intonation component, studied here with reference to EP alignment, does not form part of the lexical entry for individual words, but instead is processed separately as a guide to the proximity of the end of the word or phrase. This process could work in two possible ways. In the first alternative, when a person hears EP, they work out how many syllables are left in the word and facilitate the activation of items with that number of syllables at the competition stage. The distinction is likely to be between mono- and polysyllabic words as these were found to be significantly different in speakers’ productions in Chapter 2. Alternatively they might use EP as a signal to immediately move on to the next stage in the process and choose to recognise the word with the highest activation at that point in time. In real conditions these strategies would generally be reliable but in this experiment they may occasionally lead the listener to make errors because, in fact, not enough of the word has been heard to make the correct decision. 4.4.4.4 Implications for future research There are a number of issues to consider here. One is that EP alignment varies between speakers. If EP triggers immediate lexical access then this is unproblematic as the listener will respond in the same way regardless of the speaker’s particular alignment. If, however, EP does signal how many syllables are left in the word the listener will need knowledge of any particular speaker’s alignment patterns before this information can be used reliably. This knowledge could take the form of statistical regularities derived from a small sample of speech or of a type of episodic memory. Another issue is that the sentence-verification task used here may not be sensitive enough to have revealed everything about the role of EP in spoken word recognition. In particular, it tells us little about the time course of its usage. There are several more traditional methods that might help to distinguish when exactly EP is used. For example the gating and cross-modal priming tasks might be appropriate. Gating would provide information about the candidates generated before and after EP is heard and the recognition time in relation to EP. Cross-modal fragment priming would show whether relatives of both mono- and polysyllabic words are facilitated if a syllable is synthesised with EP aligned appropriately for either one or the other. The role of the plateau in spoken word recognition 144 Another problem is that only EP alignment is manipulated in this experiment whereas in the natural data monosyllabic words and the segmentally identical initial syllables of polysyllabic words are also distinguished by factors such as duration (shorter in polysyllables) and the rate of change of the fall (generally shallower in polysyllables). The results gained by manipulating EP alone suggest that EP itself does play a role in word recognition but a full picture cannot emerge until the roles of these other factors have been identified. It is likely that the inclusion of additional distinguishing prosodic variables will further support the case for the role of detailed prosodic information in word recognition. An additional problem is that the plateau has only been studied for nuclear falls in broad focus declaratives. Therefore it is difficult to know what role alignment might play in prenuclear accents or when the speaker uses a different intonation pattern. It could in fact be the case that EP is a phrasal property marking the distance until the end of the utterance rather than the end of the word or foot; in this case it would only play a role in nuclear position. This proposal is unlikely however because post-nuclear tails can be very long. A final problem is that SP was also manipulated in this experiment as it occurs at a fixed time before EP. Therefore we are not able to conclusively rule out SP as an important factor. However, the particular contrasts used are based on the observed alignment of EP and it is likely therefore that the results reflect this. To be certain, however, more experiments are needed where SP and EP (and probably the peak too) are independently manipulated. The role of the plateau in spoken word recognition 145 4.5 Conclusions It has been demonstrated that EP alignment does affect the recognition of words. Specifically, polysyllabic words are recognised more quickly but less accurately when they are resynthesised with an earlier, incorrect alignment. This is an interesting result as only EP was manipulated although a host of other variables differ in the detailed prosodic structure of mono- and polysyllabic words. It seems that EP does mark linguistic structure but probably not by virtue of forming part of the lexical entry for a word. More likely is that EP acts as a factor additional to the segmental phonology of the word either by triggering immediate lexical access or by allowing words with the correct number of syllables to be favoured at the competition stage.