(Introduction to) Language History and Use: Psycholinguistics Speech Perception and Spoken Word Recognition Rachael-Anne Knight -1- Speech Perception and Spoken Word Recognition I. Introduction How do we recognise words when people speak to us? In this lecture we explore how we find the correct entry in our word store (mental lexicon). NB. Here we are dealing purely with accessing the entries for individual words. We are not dealing with how we extract the meaning from larger chunks of speech. This will be covered in the lecture on Sentence and Discourse comprehension. II. Speech Perception The first stage in the recognition process is the perception of speech. A. Is Speech Special? Humans seem to process speech sounds in different ways to other sounds (Cf. sine wave speech.) We can understand speech at a rate of 20 phonemes per second but can only identify sequences of non-speech sounds at a rate slower than 1.5 sounds per second. Speech is at an advantage over non-speech sounds when heard against background noise. B. The Nature of speech There is no one to one mapping between the acoustic, physical properties of speech and the sounds we perceive. This is due to 2 related problems. 1. The Invariance and Segmentation problems The same phoneme may be produced in different ways depending on the context i.e. there is a lack of invariance. Phonemes vary according to the surrounding context. They take on some of the properties of the surrounding phonemes (assimilation) as the vocal tract begins to move to the position for the next sound (coarticulation). This also means that segmentation is impossible. Sounds slur together and the signal cannot be divided into discrete time-slices each representing a single phoneme. a) Disadvantage of these properties Sounds cannot be identified by matching then to a mental template. b) Advantage of these properties For speakers, speech can be faster as each segment does not have to be produced separately. For listeners, information about each segment is spread over time and each point in time may carry information about more than one segment. C. Adult Word Segmentation The segmentation problem does not just apply to phonemes. It would also be difficult to identify word boundaries based solely on phonological information. Look at the following pairs: Ice cream I scream Nitrate Night rate Take Gray to London Take Greater London (Introduction to) Language History and Use: Psycholinguistics Speech Perception and Spoken Word Recognition Rachael-Anne Knight -2- 1. Strategies for determining the location of word boundaries. a) Stress Patterns The majority of English words have a strong-weak stress pattern (trochaic). Therefore a strategy based on expecting word boundaries before strong syllables could be efficient. One such strategy is the Metrical Segmentation Strategy. Cutler and Butterfield (1992) played faint utterances such as “conduct ascents uphill”. Subjects reported hearing e.g. “The doctor sends the bill,” suggesting they had inserted word boundaries before the strong syllables. b) Phonotactics Different languages allow different patterns of phonemes to occur. For example /kn/ is not a legal onset in English but is in Dutch. So English speakers hearing /kn/ can segment words by assuming that there must be a word boundary between them. c) Allophones Within a language a single phoneme may be produced in different ways depending on its position within a word (allophones). E.g. /p/ is aspirated in ‘pin’ but is unaspirated in ‘spin’. Therefore hearing an aspirated /p/ suggests that it is word initial i.e. that there is a word boundary before it. Smith and Hawkins (2002) constructed nonsense words in which real words were embedded such as ‘puzoath’. Listeners could spot ‘oath’ faster if it had originally been produced with a word boundary before it suggesting that listeners are sensitive to patterns of allophonic detail. d) Knowledge of real words Listeners do not like to leave sounds unattached to possible words. The possible-word constraint (Norris et al 1997) suggests that “fill a green bucket” will be preferred to “filigree n bucket” because in the second possibility the /n/ does not form part of a word. e) Summary There are several different types of cues that depend upon sound and prosodic patterns of language. The implication is that speakers of different languages will employ different strategies for segmenting words. D. Infant Speech Perception What are the speech perception abilities of infants like? How do these abilities develop over the first year of life? (Jusczyk (1997) provides a comprehensive review of the following issues). 1. Native vs. non-native phonetic contrasts a) Birth Very young babies can discriminate both native and non-native phonetic contrasts. (Introduction to) Language History and Use: Psycholinguistics Speech Perception and Spoken Word Recognition Rachael-Anne Knight -3- b) 6-12 months In the second six months of life infants lose the ability to discriminate many of the contrasts that are not present in the ambient language 2. Features used in adult word segmentation a) Are infants sensitive to the features of language that adults use to segment speech? (1) Stress Infants prefer to listen to speech containing the dominant stress patterns of their own language such as trochaic patterns for English. 9 month old Americans were played lists of English words that were either weak-strong or strong-weak. They listened longer to the lists of strong-weak words even when the lists were low-pass filtered (removing phonotactic cues). 6 month olds showed no preference for either list suggesting that this sensitivity develops between 6 and 9 months of age. (2) Phonotactics Infants prefer to listen to speech with the phonotactics of their native language. 9 month olds were played lists of words either from their native language (English) or another language (Dutch). The infants listened longer to the lists of words from their native language. There was no difference in listening time between the two lists when they were low-pass filtered or when they were played to 6 month olds. This suggests that somewhere between 6 and 9 months old infants become sensitive to the phonotactics of their own language. (3) Allophones Infants can distinguish between pairs such as ‘night rate’ and ‘nitrate’ even when they are cross-spliced so that only allophonic and not prosodic differences remain. (4) Summary Between 6 and 9 months of age infants develop sensitivity to the patterns of their native language that adults used for word segmentation. b) Do infants use these strategies to segment speech? (1) Stress 7.5 month olds familiarised with isolated strong-weak words such as ‘doctor’ and ‘candle’ listen longer to sentences containing these words than to sentences containing controls suggesting that they can segment the individual words. They cannot perform the same task when familiarised with weak-strong words. (2) Phonotactics 10.5 month olds can perform the above task with weak-strong words if there are good phonotactic cues. (Introduction to) Language History and Use: Psycholinguistics Speech Perception and Spoken Word Recognition Rachael-Anne Knight -4- (3) Allophones 10.5 month olds familiarised with either ‘night rate’ or ‘nitrate’ listen significantly longer to sentences containing the familiar item than to ones containing the unfamiliar item suggesting they can segment the familiar from fluent speech on the basis of allophonic cues. 9 month olds cannot perform this task. (4) Summary Infants begin to segment using something like the Metrical Segmentation Strategy but then begin to use other cues (by 10.5 months old) enabling them to segment words with weak-strong patterns. Jusczyk (1997) suggests that babies’ speech perception abilities develop in order to facilitate the segmentation of individual words from the speech stream. III. Spoken Word Recognition A. Important Issues What are the processes by which adults recognise spoken words? Are these processes autonomous or interactive (i.e. can processes see each other and feed back as well as forwards) What use is made of context? B. Experimental Methods 1. Lexical Decision An item is played to the subject who must say whether it is a real word (film) or a nonword (flim). The time taken to make the decision is measured. Researchers investigate how the reaction time is affected by context, word frequency etc. 2. Priming Two words are played. The subject must say if the second word is a real word or a nonword and their reaction is timed. Researchers vary the first word to see how reaction time and errors are affected. 3. Gating Increasingly longer stretches of a word are played. The subject must identify the word as soon as they can. Researchers can measure the point at which different words are recognised and how this is affected by factors such as context. 4. Shadowing Subjects must repeat back speech as they are hearing it (the delay is usually 150-250ms). The input speech contains errors such as missing or replaced phonemes. Researchers measure if these errors are corrected in the subject’s repetition (fluent restoration). The location of errors and the type of speech (e.g. semantically unpredictable) is varied to see if this affects the subject’s repetition (Introduction to) Language History and Use: Psycholinguistics Speech Perception and Spoken Word Recognition Rachael-Anne Knight -5- C. Fundamental findings There are some classical findings that have been replicated many times. Any model of Spoken Word Recognition must attempt to explain these findings. 1. Word Frequency Effects Words that are frequent are recognised more quickly than words that are infrequent In lexical decision tasks, frequent real words are recognised more quickly than infrequent real words. E.g. ‘rain’ vs. puddle. 2. Word Supremacy Effect People are quicker to decide that a given target is a word than to recognise it as a non-word. In lexical decision tasks people are quicker to identify ‘film’ than to reject. ‘flim’. 3. Context Effects Words in context are recognised more quickly than isolated words. In the gating task, isolated words are recognised in 333ms (mean) whereas words in context are recognised in 199ms (mean). E.g. ‘camel’ vs. ‘at the zoo the children rode on a camel’. However, the situation is not a simple one and it is important to distinguish between different types of context (e.g. lexical, syntactic, semantic) and also to consider how and when context has its effect. For example, can context affect which words are considered for recognition or only help to eliminate candidates or only help to integrate a recognised word into the larger utterance? (see Harley for a review). 4. Distortion Effects Words that are distorted at the beginning are recognised more slowly than words that are distorted at the end. In shadowing tasks, distortions are more likely to be restored if they occur in the 3rd rather than the 1st syllable e.g. tragety vs. dragedy. D. A word of warning! The terms 'word recognition' and 'lexical access' are used differently by many researchers. In this lecture 'word recognition' is the point where you know a word exists in your lexicon and you know what the word itself is but don't know anything about it (i.e. whether it’s a noun or a verb, what it means). 'Lexical access' is the point where all the semantic and syntactic information about the word is available to you. Please bear in mind that this is not the case in every article or textbook. Sometimes the words are used interchangeably and sometimes the definitions I give here are reversed. This is very confusing so be careful and also make sure you know how you are using the terms. (Introduction to) Language History and Use: Psycholinguistics Speech Perception and Spoken Word Recognition Rachael-Anne Knight -6- E. Models of SWR 1. The Original Cohort Model a) Summary This model assumes that on the basis of the first 250 ms of speech a cohort of possible words is set up. As more speech input is heard items from that cohort are eliminated. Word Recognition occurs when only one item is left in the cohort. Process to recognise the word 'slave' a. Listener hears /sl/ and sets up an initial cohort (access stage) slow slip slack slide slave sleigh b.1 Listener hears /ei/ and eliminates non-matching candidates (selection stage) slave sleigh b.2. Listener hears /v/ and eliminates final non-matching candidate slave c. Listener uses knowledge about syntax and semantics to integrate the word into the sentence (integration stage) b) How well does the Cohort Model account for the evidence? Word Frequency The original model does not explain why frequent words are recognised quickly. Word Supremacy You have to eliminate all the words in the cohort before you can decide a target is not a real word whereas real words can even be recognised before they are finished. Context Effects Context can be used to help to eliminate candidates from the cohort so will improve recognition speed. But, context might to be too powerful here. (Introduction to) Language History and Use: Psycholinguistics Speech Perception and Spoken Word Recognition Rachael-Anne Knight -7- Distortion Effects An initial distortion will be more problematic as you will set up your cohort wrongly. A later distortion may come after the recognition point and other (contextual) cues can be used to help you. However in a way this works too well. How can you ever recognise the word if you've set up the wrong cohort? 2. The Revised Cohort Model a) Modifications The principle is the same as in the original model in that listeners still set up an initial cohort of candidates. However, the elimination process is no longer all-or-nothing. Items that do not receive further positive information decay in activation rather than being eliminated and a word is recognised when it has a higher relative activation than other words in the cohort. This allows for backtracking if you mishear a word or it is distorted. Word frequency is built into this model by saying that frequent words may become activated more quickly than infrequent words. Context loses some of its power, as it cannot be used to influence the items that form the initial cohort. 3. The TRACE Model The TRACE model is a connectionist network. Important features of these models are: There are lots of simple processing units Processing units are arranged into levels Units are joined together by weighted, bi-directional connections Input units are activated by incoming information Activation spreads along the connections to other units a) Summary The levels of processing units in TRACE are features, phonemes and words. Within a single level the connections between units are inhibitory. Connections between consistent units on different levels are excitatory. Information flows in both directions and top-down information (context) affects activation at lower levels. (Introduction to) Language History and Use: Psycholinguistics Speech Perception and Spoken Word Recognition Rachael-Anne Knight -8- CONTEXT pan ban nan +voi /k/ /b/ /p/ -son -nas can -voi SENSORY INPUT Figure 1 A simplified diagram of TRACE An example - to recognise the word - /pan/ a. Listener hears input with the features -sonorant and -voice. These features become activated. This activation inhibits other feature nodes b. The activation spreads to the next level and activates the phonemes with these features (/p/, /k/) which then exerts an inhibitory influence on the surrounding phonemes. Activation also feeds back to the feature level to reinforce the activation of -sonorant and -voice c. The /p/ and /k/ phonemes activate the words /pan/ and /kan/ in the word level which inhibit other word nodes. Activation also feeds back to activate /p/ and /k/ and in turn the relevant features. d. All the time, contextual information is used which helps to activate word nodes that are syntactically or semantically appropriate for the context b) How well does TRACE account for the evidence? Word Frequency The model does not currently explain why frequent words are recognised quickly but modifying it in the same way as the Cohort model would help. Word Supremacy Real words feed back down to the lower levels to help reinforce earlier information. Nonwords do not have this advantage. However it is not clear exactly how the model would finally make a decision that a word didn't exist in the inventory. (Introduction to) Language History and Use: Psycholinguistics Speech Perception and Spoken Word Recognition Rachael-Anne Knight -9- Context Effects Context feeds down to affect the perceptual level. But, again, context might to be too powerful here. Distortion Effects It’s easy to recover from distortions because you are not so reliant on hearing everything perfectly. Word initial sounds contribute more to the activation of word nodes than later sounds. IV. Conclusions Both of the models discussed can explain much of the experimental evidence concerning recognition. However, no single model seems to be able to explain all of the evidence and many models make predictions that are not supported by the evidence. The role of context and the ability to segment the speech stream are still problematic for many models of Spoken Word Recognition. V. Glossary Strong syllable A syllable that is stressed and does not have a reduced vowel Weak syllable A syllable that is not stressed and may contain a reduced vowel Trochaic A strong-weak stress pattern (e.g. doctor) Phonotactics Sequential constraints that operate on adjacent phonetic segments Low-pass filtering Removal of high frequencies (removes segmental identity but preserves prosodic characteristics) (Introduction to) Language History and Use: Psycholinguistics Speech Perception and Spoken Word Recognition Rachael-Anne Knight VI. - 10 - Reading and References Reading Harley, T. (2001) The Psychology of Language, Cambridge CUP, Chapter 8 Goldinger, S., D. Pisoni, & P. Luce (1996) “Speech perception and spoken word recognition: research and theory”. In N. Lass, (ed.), Principles of Experimental Phonetics, St. Louis: Moseby. 277-327. Harris, M. and Coltheart, M. (1989) Language Processing in Children and Adults, London: Routledge, 159-171 (available on reserve in the MML library – ask at the front desk) Lively, S. and Goldinger, S. (1994) “Spoken Word Recognition: Research and Theory” In M. Gernsbacher (ed,) Handbook of Psycholinguistics, San Diego: Academic Press 265-301 Jusczyk, P. (1997) The Discovery of Spoken Language, Cambridge, Massachusetts: MIT Press, Chapters 2 and 3 (for an incredibly comprehensive and easy to read survey of infant speech perception abilities) Grosjean, F. and Frauenfelder, U. (eds.) (1997) A Guide to Spoken Word Recognition Paradigms, Hove: Psychology Press References Cutler, A. and Butterfield, S. (1992) “Rhythmic cues to speech segmentation: Evidence from juncture misperception.” Journal of Memory and Language, 31, 218-236 Norris, D., McQueen, J., Cutler, A., and Butterfield, S. (1997) “The possible-word constraint in the segmentation of continuous speech.” Cognitive Psychology, 34, 191-243 Smith, R. and Hawkins S. (2000) “Allophonic influences on word spotting experiments.” Proceedings of the ISCA Workshop on Spoken Word Access Processes, Nijmegen, The Netherlands, 139-142