Speech dynamics

Speech Dynamics
The Main Idea: At an abstract linguistic level, phonetic
segments ([b], [p], [r], [k], [i], [ɑ], [u], etc.) are discrete,
independent, interchangeable snap-together parts – like
beads on a string.
Artist’s rendition of beads.
The word [kQt] is built by stringing together 3
distinct, discrete “beads” – [k], [æ], and [t].
The [Q] bead is snapped out, an [ɪ] bead is
snapped in, changing [kæt] into [kɪt].
The [t] bead is snapped out, an [n] bead
is snapped in, changing [kɪt] into [kɪn].
The [ɪ] bead is snapped out, an [æ] bead
is snapped in, changing [kɪn] into [kæn].
The [k] bead is snapped out, an [m] bead
is snapped in, changing [kæn] into
The phonetic structure of speech is an example
of a discrete combinatorial system: morphemes
and words are built by combining separate,
independent, snap-together parts called
phonemes or phonetic segments.
DNA works this way as well: Genes are built by
combining a finite (and very small) number of
discrete, snap-together parts (adenine, guanine,
cytosine, & thymine) in an infinite variety of
The letters that are used in any alphabetic writing
system comprise a discrete combinatorial system
– you get infinite variety by combining 26 discrete
elements – a, b, c, d … z.
Are there aspects of language other than
morphemes & words built by combining
phonemes that comprise a discrete combinatorial
(Answer: Yes. Language is a discrete combinatorial system everywhere you look. Words
are constructed by using morphological rules to combine discrete parts called morphemes.
Phrases are constructed by using syntactic and semantic rules to combine words – nouns,
verbs, articles, prepositions, etc. Sentences are constructed by using syntactic rules to combine
phrases, the simplest being S NP + VP; i.e., a sentence is composed of a noun phrase and a
verb phrase. So, language is a discrete combinatorial system at all levels.)
The snap-together parts in a discrete combinatorial
system are: (1) discrete, (2) independent.
(1) discrete = digital rather than analog (continuous);
e.g., in genetics, the bases must be one of four
discrete choices: A, G, C, or T; no such thing as
guanine and 3/4th; adenine and a 1/2. In phonetics,
at an abstract level, snap-together phonemes are
one of ~40 discrete choices; e.g., /b/ or /p/, not /b/
and 2/3rd.
(2) independent = the snap-together parts do not
affect one another; e.g., in genetics, guanine is
guanine whether it is attached to adenine,
thymine, cytosine, or another guanine. In
phonemics (not phonetics), /p/ is /p/, whether the
next phoneme is /ɑ/ or /i/ or /æ/ or /r/ or /l/ or
OK, now the big deal about accommodation,
coarticulation, and assimilation.
Here’s the idea we started with: At an abstract
linguistic level, phonetic segments are discrete,
independent, interchangeable snap-together
The key phrase here is at an abstract linguistic
level. At an abstract linguistic level phonemes are
discrete, independent, and interchangeable snaptogether parts.
However, at the actual level of speech
production – real movements of real
articulators generating real speech
sounds – phonetic segments are:
• not discrete
• not independent
• not interchangeable
spectra measured
during the [s]
In a world of discrete, independent, interchangeable parts, the
two instances of [s] would be identical. Are they? The phonetic
environment in which a sound occurs can have a strong
influence on the way a sound is produced. This is an example of
coarticulation – the lip rounding from the [u] carries over to the
[s]; i.e., the rounded lips for [u] just stay rounded for [s].
[s] from [usu] -> [isi]
[s] from [isi] -> [usu]
This shows what happens when the [s] of [isi] is
cut and pasted to [usu], and vice versa. Do the
cross-spliced syllables sound ok?
What does this tell us about the interchangeability of the phonetic elements that are being
recombined in this discrete combinatorial
system? [Answer: They aren’t]
The phenomenon of coarticulation is nothing more
than an articulatory shortcut.
Without coarticulation, producing [usu] would involve:
1)round the lips (and adjust the tongue) for [u]
2)retract the lips (and adjust the tongue) for [s]
3)round the lips again (and adjust the tongue) for [u]
It’s not that this can’t be done. It can. Easily. The
problem is that it slows you down. A lot.
Most people speak almost as fast as they can almost
all the time. And we talk fast – ~10 speech sounds per
second and up.
In terms of the # of muscles, the complexity
of the movements, and the precision of the
movements, there is nothing we do that
approaches speech for speed, complexity,
and efficiency.
The primary reason for this: coarticulation.
Coarticulation is not an arcane, egghead
detail. It’s the single most important fact
about speech motor control.
If you understand what’s going on in this one
example – [isi] vs. [usu] – then you understand
most of what you need to know about
Basic Motor-Planning Principle: The motor
planning system will take any shortcut that the
ear and the rules of the language will tolerate.
That’s it. (Memorize that sentence, and be sure
you understand what it means. No kidding.)
[isi] vs. [usu]: If the lip-rounding shortcut
produced a fricative that was so distorted that it
no longer sounded like an [s], the shortcut
wouldn’t be used. But the [s] remains quite
intelligible, so the shortcut is used.
Another example (we’ve already seen this one):
geese vs. gone ([gis] vs. [gɔn])
Is the place of articulation the same? [Hint: No. Place
for [gɔn] is velar, place for [gis] is palatal.]
What’s going on here; i.e., what’s the shortcut?
Why is the place of articulation further forward for
[gis] than [gɔn]? [Answer: You get a more forward place
of artic for the front vowel than for the back vowel, so the
movements take less time.]
Would this shortcut work for a language that has
both velar and palatal stops? [Answer: Nah]
This is, in part, what was meant earlier by “and the
rules of the language.”
Another example we’ve seen:
bat vs. man ([bæt] vs. [mæn])
What happens to the velum during the [æ] in these
two words?
Phonetically, this is:
[bæt] vs. [mæ̃n]
What’s the shortcut?
Why is the shortcut tolerated? [A: Listeners are ok
with nasalized vowels]
There are times when the shortcut is not tolerated by the ear
directly, but it’s allowed by still allowed by listeners. These
are examples of assimilation. Transcribe these:
this steak vs. this shoe
What’s the shortcut? Do you get an acceptable [s] in the
word “this” in both cases?
Produce this sentence as naturally as you can:
I get my news by listening to NPR.
How about this:
I called her from a phone booth.
And this, naturally and at a fairly rapid speaking rate:
did he vs. did you
Accommodation: MacKay’s word for an articulatory
shortcut. Two Varieties of Accommodation
The sound that’s
modified by the shortcut
is directly tolerated by
the ear; e.g., the [s] in
[usu] still sounds like an
[s]; the [g] in [gis] still
sounds like a [g], etc.
The sound that’s modified
by the shortcut is not
directly tolerated by the
ear; e.g., the underlying
/s/ of “this” in “this shoe”
no longer sounds like an
/s/; the modified sound –
now an [ʃ] – is tolerated
by the language system.
Last point:
The impression you may have is that the
motor planning system cares only about
speed. This can’t be. Imagine a speaker who
says only this:
“buh-buh-buh” – very fast, but you can't
distinguish one word from another.
Mature speakers do not pronounce cupcake
as cupake [kʌpek], though that would be
faster; or cukake [kʌkek] (toddlers take that kind of
shortcut all the time – but NOT for reasons of speed),
though that too would be faster. Why?
The motor planning system is some poorly
understood compromise among three
(1) speed
(2) intelligibility
(3) what listeners will accept as natural
The last one is the trickiest. For reasons
that are not too clear, listeners are ok with
thish shoe [ðɪʃu] and foam booth [fom buθ])
but they’re not ok with cukake [kʌkek] or
“cupake” [kʌpek].
Details aside, here’s the key idea:
The basic motor-planning principle
underlying accommodation (coarticulation
& assimilation) is this:
The motor planning system will take any
shortcut that listeners will tolerate (i.e., the
speech needs to be intelligible and it needs to sound
Q: Why do we take these shortcuts?
A: Simple: They allow us to speak faster.