Spoken Word Recognition

advertisement
(Introduction to) Language History and Use: Psycholinguistics
Speech Perception and Spoken Word Recognition
Rachael-Anne Knight
-1-
Speech Perception and Spoken Word Recognition
I.
Introduction
How do we recognise words when people speak to us? In this lecture we explore how we
find the correct entry in our word store (mental lexicon). NB. Here we are dealing
purely with accessing the entries for individual words. We are not dealing with how we
extract the meaning from larger chunks of speech. This will be covered in the lecture on
Sentence and Discourse comprehension.
II.
Speech Perception
The first stage in the recognition process is the perception of speech.
A. Is Speech Special?
Humans seem to process speech sounds in different ways to other sounds (Cf. sine wave
speech.) We can understand speech at a rate of 20 phonemes per second but can only
identify sequences of non-speech sounds at a rate slower than 1.5 sounds per second.
Speech is at an advantage over non-speech sounds when heard against background noise.
B. The Nature of speech
There is no one to one mapping between the acoustic, physical properties of speech and
the sounds we perceive. This is due to 2 related problems.
1. The Invariance and Segmentation problems
The same phoneme may be produced in different ways depending on the context i.e. there
is a lack of invariance. Phonemes vary according to the surrounding context. They take
on some of the properties of the surrounding phonemes (assimilation) as the vocal tract
begins to move to the position for the next sound (coarticulation). This also means that
segmentation is impossible. Sounds slur together and the signal cannot be divided into
discrete time-slices each representing a single phoneme.
a) Disadvantage of these properties
Sounds cannot be identified by matching then to a mental template.
b) Advantage of these properties
For speakers, speech can be faster as each segment does not have to be produced
separately. For listeners, information about each segment is spread over time and each
point in time may carry information about more than one segment.
C. Adult Word Segmentation
The segmentation problem does not just apply to phonemes. It would also be difficult to
identify word boundaries based solely on phonological information. Look at the
following pairs:
Ice cream
I scream
Nitrate
Night rate
Take Gray to London
Take Greater London
(Introduction to) Language History and Use: Psycholinguistics
Speech Perception and Spoken Word Recognition
Rachael-Anne Knight
-2-
1. Strategies for determining the location of word boundaries.
a) Stress Patterns
The majority of English words have a strong-weak stress pattern (trochaic). Therefore a
strategy based on expecting word boundaries before strong syllables could be efficient.
One such strategy is the Metrical Segmentation Strategy. Cutler and Butterfield (1992)
played faint utterances such as “conduct ascents uphill”. Subjects reported hearing e.g.
“The doctor sends the bill,” suggesting they had inserted word boundaries before the
strong syllables.
b) Phonotactics
Different languages allow different patterns of phonemes to occur. For example /kn/ is
not a legal onset in English but is in Dutch. So English speakers hearing /kn/ can
segment words by assuming that there must be a word boundary between them.
c) Allophones
Within a language a single phoneme may be produced in different ways depending on its
position within a word (allophones). E.g. /p/ is aspirated in ‘pin’ but is unaspirated in
‘spin’. Therefore hearing an aspirated /p/ suggests that it is word initial i.e. that there is a
word boundary before it. Smith and Hawkins (2002) constructed nonsense words in
which real words were embedded such as ‘puzoath’. Listeners could spot ‘oath’ faster if
it had originally been produced with a word boundary before it suggesting that listeners
are sensitive to patterns of allophonic detail.
d) Knowledge of real words
Listeners do not like to leave sounds unattached to possible words. The possible-word
constraint (Norris et al 1997) suggests that “fill a green bucket” will be preferred to
“filigree n bucket” because in the second possibility the /n/ does not form part of a word.
e) Summary
There are several different types of cues that depend upon sound and prosodic patterns of
language. The implication is that speakers of different languages will employ different
strategies for segmenting words.
D. Infant Speech Perception
What are the speech perception abilities of infants like? How do these abilities develop
over the first year of life? (Jusczyk (1997) provides a comprehensive review of the
following issues).
1. Native vs. non-native phonetic contrasts
a) Birth
Very young babies can discriminate both native and non-native phonetic contrasts.
(Introduction to) Language History and Use: Psycholinguistics
Speech Perception and Spoken Word Recognition
Rachael-Anne Knight
-3-
b) 6-12 months
In the second six months of life infants lose the ability to discriminate many of the
contrasts that are not present in the ambient language
2. Features used in adult word segmentation
a) Are infants sensitive to the features of language that
adults use to segment speech?
(1) Stress
Infants prefer to listen to speech containing the dominant stress patterns of their own
language such as trochaic patterns for English. 9 month old Americans were played lists
of English words that were either weak-strong or strong-weak. They listened longer to
the lists of strong-weak words even when the lists were low-pass filtered (removing
phonotactic cues). 6 month olds showed no preference for either list suggesting that this
sensitivity develops between 6 and 9 months of age.
(2) Phonotactics
Infants prefer to listen to speech with the phonotactics of their native language. 9 month
olds were played lists of words either from their native language (English) or another
language (Dutch). The infants listened longer to the lists of words from their native
language. There was no difference in listening time between the two lists when they
were low-pass filtered or when they were played to 6 month olds. This suggests that
somewhere between 6 and 9 months old infants become sensitive to the phonotactics of
their own language.
(3) Allophones
Infants can distinguish between pairs such as ‘night rate’ and ‘nitrate’ even when they are
cross-spliced so that only allophonic and not prosodic differences remain.
(4) Summary
Between 6 and 9 months of age infants develop sensitivity to the patterns of their native
language that adults used for word segmentation.
b) Do infants use these strategies to segment speech?
(1) Stress
7.5 month olds familiarised with isolated strong-weak words such as ‘doctor’ and
‘candle’ listen longer to sentences containing these words than to sentences containing
controls suggesting that they can segment the individual words. They cannot perform the
same task when familiarised with weak-strong words.
(2) Phonotactics
10.5 month olds can perform the above task with weak-strong words if there are good
phonotactic cues.
(Introduction to) Language History and Use: Psycholinguistics
Speech Perception and Spoken Word Recognition
Rachael-Anne Knight
-4-
(3) Allophones
10.5 month olds familiarised with either ‘night rate’ or ‘nitrate’ listen significantly longer
to sentences containing the familiar item than to ones containing the unfamiliar item
suggesting they can segment the familiar from fluent speech on the basis of allophonic
cues. 9 month olds cannot perform this task.
(4) Summary
Infants begin to segment using something like the Metrical Segmentation Strategy but
then begin to use other cues (by 10.5 months old) enabling them to segment words with
weak-strong patterns. Jusczyk (1997) suggests that babies’ speech perception abilities
develop in order to facilitate the segmentation of individual words from the speech
stream.
III.
Spoken Word Recognition
A. Important Issues
 What are the processes by which adults recognise spoken words?
 Are these processes autonomous or interactive (i.e. can processes see each other
and feed back as well as forwards)
 What use is made of context?
B. Experimental Methods
1. Lexical Decision
An item is played to the subject who must say whether it is a real word (film) or a nonword (flim). The time taken to make the decision is measured. Researchers investigate
how the reaction time is affected by context, word frequency etc.
2. Priming
Two words are played. The subject must say if the second word is a real word or a nonword and their reaction is timed. Researchers vary the first word to see how reaction time
and errors are affected.
3. Gating
Increasingly longer stretches of a word are played. The subject must identify the word as
soon as they can. Researchers can measure the point at which different words are
recognised and how this is affected by factors such as context.
4. Shadowing
Subjects must repeat back speech as they are hearing it (the delay is usually 150-250ms).
The input speech contains errors such as missing or replaced phonemes. Researchers
measure if these errors are corrected in the subject’s repetition (fluent restoration). The
location of errors and the type of speech (e.g. semantically unpredictable) is varied to see
if this affects the subject’s repetition
(Introduction to) Language History and Use: Psycholinguistics
Speech Perception and Spoken Word Recognition
Rachael-Anne Knight
-5-
C. Fundamental findings
There are some classical findings that have been replicated many times. Any model of
Spoken Word Recognition must attempt to explain these findings.
1. Word Frequency Effects
Words that are frequent are recognised more quickly than words that are
infrequent
In lexical decision tasks, frequent real words are recognised more quickly than infrequent
real words. E.g. ‘rain’ vs. puddle.
2. Word Supremacy Effect
People are quicker to decide that a given target is a word than to recognise it as a
non-word.
In lexical decision tasks people are quicker to identify ‘film’ than to reject. ‘flim’.
3. Context Effects
Words in context are recognised more quickly than isolated words.
In the gating task, isolated words are recognised in 333ms (mean) whereas words in
context are recognised in 199ms (mean). E.g. ‘camel’ vs. ‘at the zoo the children rode on
a camel’.
However, the situation is not a simple one and it is important to distinguish between
different types of context (e.g. lexical, syntactic, semantic) and also to consider how and
when context has its effect. For example, can context affect which words are considered
for recognition or only help to eliminate candidates or only help to integrate a recognised
word into the larger utterance? (see Harley for a review).
4. Distortion Effects
Words that are distorted at the beginning are recognised more slowly than words
that are distorted at the end.
In shadowing tasks, distortions are more likely to be restored if they occur in the 3rd
rather than the 1st syllable e.g. tragety vs. dragedy.
D. A word of warning!
The terms 'word recognition' and 'lexical access' are used differently by many researchers.
In this lecture 'word recognition' is the point where you know a word exists in your
lexicon and you know what the word itself is but don't know anything about it (i.e.
whether it’s a noun or a verb, what it means). 'Lexical access' is the point where all the
semantic and syntactic information about the word is available to you. Please bear in
mind that this is not the case in every article or textbook. Sometimes the words are used
interchangeably and sometimes the definitions I give here are reversed. This is very
confusing so be careful and also make sure you know how you are using the terms.
(Introduction to) Language History and Use: Psycholinguistics
Speech Perception and Spoken Word Recognition
Rachael-Anne Knight
-6-
E. Models of SWR
1. The Original Cohort Model
a) Summary
This model assumes that on the basis of the first 250 ms of speech a cohort of possible
words is set up. As more speech input is heard items from that cohort are eliminated.
Word Recognition occurs when only one item is left in the cohort.
Process to recognise the word 'slave'
a. Listener hears /sl/ and sets up an initial cohort (access stage)
slow
slip
slack
slide
slave
sleigh
b.1 Listener hears /ei/ and eliminates non-matching candidates (selection stage)
slave
sleigh
b.2. Listener hears /v/ and eliminates final non-matching candidate
slave
c. Listener uses knowledge about syntax and semantics to integrate the word into the
sentence (integration stage)
b) How well does the Cohort Model account for the
evidence?
Word Frequency
The original model does not explain why frequent words are recognised quickly.
Word Supremacy
You have to eliminate all the words in the cohort before you can decide a target is not a
real word whereas real words can even be recognised before they are finished.
Context Effects
Context can be used to help to eliminate candidates from the cohort so will improve
recognition speed.
But, context might to be too powerful here.
(Introduction to) Language History and Use: Psycholinguistics
Speech Perception and Spoken Word Recognition
Rachael-Anne Knight
-7-
Distortion Effects
An initial distortion will be more problematic as you will set up your cohort wrongly.
A later distortion may come after the recognition point and other (contextual) cues can be
used to help you.
However in a way this works too well. How can you ever recognise the word if you've
set up the wrong cohort?
2. The Revised Cohort Model
a) Modifications
The principle is the same as in the original model in that listeners still set up an initial
cohort of candidates. However, the elimination process is no longer all-or-nothing.
Items that do not receive further positive information decay in activation rather than
being eliminated and a word is recognised when it has a higher relative activation than
other words in the cohort. This allows for backtracking if you mishear a word or it is
distorted. Word frequency is built into this model by saying that frequent words may
become activated more quickly than infrequent words. Context loses some of its power,
as it cannot be used to influence the items that form the initial cohort.
3. The TRACE Model
The TRACE model is a connectionist network. Important features of these models are:





There are lots of simple processing units
Processing units are arranged into levels
Units are joined together by weighted, bi-directional connections
Input units are activated by incoming information
Activation spreads along the connections to other units
a) Summary
The levels of processing units in TRACE are features, phonemes and words. Within a
single level the connections between units are inhibitory. Connections between
consistent units on different levels are excitatory. Information flows in both directions
and top-down information (context) affects activation at lower levels.
(Introduction to) Language History and Use: Psycholinguistics
Speech Perception and Spoken Word Recognition
Rachael-Anne Knight
-8-
CONTEXT
pan
ban
nan
+voi
/k/
/b/
/p/
-son
-nas
can
-voi
SENSORY
INPUT
Figure 1 A simplified diagram of TRACE
An example - to recognise the word - /pan/
a. Listener hears input with the features -sonorant and -voice. These features become
activated. This activation inhibits other feature nodes
b. The activation spreads to the next level and activates the phonemes with these
features (/p/, /k/) which then exerts an inhibitory influence on the surrounding
phonemes. Activation also feeds back to the feature level to reinforce the activation
of -sonorant and -voice
c. The /p/ and /k/ phonemes activate the words /pan/ and /kan/ in the word level which
inhibit other word nodes. Activation also feeds back to activate /p/ and /k/ and in turn
the relevant features.
d. All the time, contextual information is used which helps to activate word nodes that
are syntactically or semantically appropriate for the context
b) How well does TRACE account for the evidence?
Word Frequency
The model does not currently explain why frequent words are recognised quickly but
modifying it in the same way as the Cohort model would help.
Word Supremacy
Real words feed back down to the lower levels to help reinforce earlier information.
Nonwords do not have this advantage. However it is not clear exactly how the model
would finally make a decision that a word didn't exist in the inventory.
(Introduction to) Language History and Use: Psycholinguistics
Speech Perception and Spoken Word Recognition
Rachael-Anne Knight
-9-
Context Effects
Context feeds down to affect the perceptual level.
But, again, context might to be too powerful here.
Distortion Effects
It’s easy to recover from distortions because you are not so reliant on hearing
everything perfectly.
Word initial sounds contribute more to the activation of word nodes than later sounds.
IV.
Conclusions
Both of the models discussed can explain much of the experimental evidence concerning
recognition. However, no single model seems to be able to explain all of the evidence
and many models make predictions that are not supported by the evidence. The role of
context and the ability to segment the speech stream are still problematic for many
models of Spoken Word Recognition.
V.
Glossary
Strong syllable
A syllable that is stressed and does not have a reduced vowel
Weak syllable
A syllable that is not stressed and may contain a reduced vowel
Trochaic
A strong-weak stress pattern (e.g. doctor)
Phonotactics
Sequential constraints that operate on adjacent phonetic segments
Low-pass filtering Removal of high frequencies (removes segmental identity but
preserves prosodic characteristics)
(Introduction to) Language History and Use: Psycholinguistics
Speech Perception and Spoken Word Recognition
Rachael-Anne Knight
VI.
- 10 -
Reading and References
Reading
Harley, T. (2001) The Psychology of Language, Cambridge CUP, Chapter 8
Goldinger, S., D. Pisoni, & P. Luce (1996) “Speech perception and spoken word
recognition: research and theory”. In N. Lass, (ed.), Principles of Experimental
Phonetics, St. Louis: Moseby. 277-327.
Harris, M. and Coltheart, M. (1989) Language Processing in Children and Adults,
London: Routledge, 159-171 (available on reserve in the MML library – ask at the front
desk)
Lively, S. and Goldinger, S. (1994) “Spoken Word Recognition: Research and Theory”
In M. Gernsbacher (ed,) Handbook of Psycholinguistics, San Diego: Academic Press
265-301
Jusczyk, P. (1997) The Discovery of Spoken Language, Cambridge, Massachusetts: MIT
Press, Chapters 2 and 3 (for an incredibly comprehensive and easy to read survey of
infant speech perception abilities)
Grosjean, F. and Frauenfelder, U. (eds.) (1997) A Guide to Spoken Word Recognition
Paradigms, Hove: Psychology Press
References
Cutler, A. and Butterfield, S. (1992) “Rhythmic cues to speech segmentation: Evidence
from juncture misperception.” Journal of Memory and Language, 31, 218-236
Norris, D., McQueen, J., Cutler, A., and Butterfield, S. (1997) “The possible-word
constraint in the segmentation of continuous speech.” Cognitive Psychology, 34, 191-243
Smith, R. and Hawkins S. (2000) “Allophonic influences on word spotting experiments.”
Proceedings of the ISCA Workshop on Spoken Word Access Processes, Nijmegen, The
Netherlands, 139-142
Download