Speech Sounds of American English

advertisement
Lecture # 3-4
Session 2003
Speech Sounds of American English
• There are over 40 speech sounds in American English which can
be organized by their basic manner of production
Manner Class
Vowels
Fricatives
Stops
Nasals
Semivowels
Affricates
Aspirant
Number
18
8
6
3
4
2
1
• Vowels, glides, and consonants differ in degree of constriction
• Sonorant consonants have no pressure build up at constriction
• Nasal consonants lower the velum allowing airflow in nasal cavity
• Continuant consonants do not block airflow in oral cavity
6.345 Automatic Speech Recognition
Speech Sounds 1
Vowel Production
• No significant constriction in the vocal tract
• Usually produced with periodic excitation
• Acoustic characteristics depend on the position of the jaw,
tongue, and lips
[i]
6.345 Automatic Speech Recognition
[@]
[a]
[u]
Speech Sounds 2
Vowels of American English
• There are approximately 18 vowels in American English made up
of monothongs, diphthongs, and reduced vowels (schwa’s)
/i¤ /
/I/
/e¤ /
/E/
/@/
/a/
iy
ih
ey
eh
ae
aa
beat
bit
bait
bet
bat
Bob
/O/
/^/
/o⁄ /
/U/
/u/
/5/
ao
ah
ow
uh
uw
er
bought
but
boat
book
boot
Bert
/a¤ /
/O¤ /
/a⁄ /
[{]
[|]
[}]
ay
oy
aw
ax
ix
axr
bite
Boyd
bout
about
roses
butter
• They are often described by the articulatory features: High/Low,
Front/Back, Retroflexed, Rounded, and Tense/Lax
6.345 Automatic Speech Recognition
Speech Sounds 3
Spectrograms of the Cardinal Vowels
16
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
0.5
0.6
0.7
kHz 8
0
16
16
0.5
0.6
0.7
16
16
0.7
16
16
0
0
0
0
0
dB
dB
dB
dB
dB
dB
dB
8
8
8
7
7
7
7
6
6
6
5
5
5
Wide Band Spectrogram
kHz 4
4 kHz
0.5
0.6
0.7
16
8 kHz
0
Total Energy
dB
dB
dB
dB
Energy -- 125 Hz to 750 Hz
8
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
0
Total Energy
dB
Energy -- 125 Hz to 750 Hz
kHz 4
0.6
8 kHz
kHz 8
Energy -- 125 Hz to 750 Hz
Wide Band Spectrogram
0.5
8 kHz
kHz 8
Total Energy
dB
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
8 kHz
kHz 8
Total Energy
dB
8
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
dB
Energy -- 125 Hz to 750 Hz
dB
8
8
7
7
7
7
6
6
6
6
6
5
5
5
5
5
Wide Band Spectrogram
kHz 4
4 kHz
8
Wide Band Spectrogram
kHz 4
4 kHz
4 kHz
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
Waveform
0.0
0.1
Waveform
0.2
0.3
0.4
0.5
0.6
0.7
0.0
beet
/bi¤ t/
6.345 Automatic Speech Recognition
0.1
Waveform
0.2
0.3
0.4
bat
/b@t/
0.5
0.6
0.7
0.0
0.1
0
Waveform
0.2
0.3
0.4
bott
/bat/
0.5
0.6
0.7
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
boot
/but/
Speech Sounds 4
Vowel Formant Averages
• Vowels are often characterized by the lower three formants
• High/Low is correlated with the first formant, F1
• Front/Back is correlated with the second formant, F2
• Retroflexion is marked by a low third formant, F3
Female Speakers
Male Speakers
3500
3500
F3
F2
F1
F3
F1
3000
Average Frequency (Hz)
Average Frequency (Hz)
3000
F2
2500
2000
1500
1000
2500
2000
1500
1000
500
500
0
0
i¤
I e¤ E @ a O ^ o⁄ U u 5 {
Vowel
6.345 Automatic Speech Recognition
|
i¤
I e¤ E @ a O ^ o⁄ U u 5 {
|
Vowel
Speech Sounds 5
Vowel Durations
• Each vowel has a different intrinsic duration
• Schwa’s have distinctly shorter durations (50ms)
• /I, E, ^, U/ are the shortest monothongs
• Context can greatly influence vowel duration
Male Speakers
250
250
200
200
Average Duration (ms)
Average Duration (ms)
Female Speakers
150
100
50
0
150
100
50
0
i¤ I e¤ E @ a O ^ o⁄ U u 5 { | a⁄ o¤ a¤ ¤u
i¤ I e¤ E @ a O ^ o⁄ U u 5 { | a⁄ o¤ a¤ ¤u
Vowel
Vowel
6.345 Automatic Speech Recognition
Speech Sounds 6
Rob
's
Happy Little Vowel Chart
"So inaccurate, yet so useful."
FRONT
k F3
n
i
Th
i
F2 Increases
I
uÚ
e
E
ow
?
o
u
HIGH
Yo u
r
p a l 5 is
go!
o
t
ay
the w
@
^,{
U
BACK
is might
yl
TENSE = Towards Edges
tends to be longer
O
MID
a
LOW
LAX = Towards Center
tends to be shorter
SCHWAS:
Plain [{] About [{ba⁄t]
Front [|] Roses [ro⁄z|z]
Retroflex [}] Forever [f}Ev}]
F1 Increases
6.345 Automatic Speech Recognition
Speech Sounds 7
Fricative Production
• Turbulence produced at narrow constriction
• Constriction position determines acoustic characteristics
• Can be produced with periodic excitation
[f]
6.345 Automatic Speech Recognition
[T]
[s]
[S]
Speech Sounds 8
Fricatives of American English
• There are 8 fricatives in American English
• Four places of articulation: Labio-Dental (Labial), Interdental
(Dental), Alveolar, and Palato-Alveolar (Palatal)
• They are often described by the features Voiced/Unvoiced, or
Strident/Non-Strident (constriction behind alveolar ridge)
Type
Labial
Dental
Alveolar
Palatal
6.345 Automatic Speech Recognition
Unvoiced
/f/
f fee
/T/ th thief
/s/ s see
/S/ sh she
/v/
/D/
/z/
/Z/
Voiced
v v
dh thee
z z
zh Gigi
Speech Sounds 9
Spectrograms of Unvoiced Fricatives
16
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
0.5
0.6
0.7
kHz 8
0
16
16
0.5
0.6
0.7
16
16
0.6
0.7
0.8
16
16
0
0
0
0
0
dB
dB
dB
dB
dB
dB
dB
8
8
8
7
7
7
7
6
6
6
5
5
5
Wide Band Spectrogram
kHz 4
4 kHz
Time (seconds)
0.3
0.4
0.5
0.6
0.7
16
8 kHz
0
Total Energy
dB
dB
dB
dB
Energy -- 125 Hz to 750 Hz
8
0.0
0.1
0.2
Zero Crossing Rate
0
Total Energy
dB
Energy -- 125 Hz to 750 Hz
kHz 4
Time (seconds)
0.4
0.5
8 kHz
kHz 8
Energy -- 125 Hz to 750 Hz
Wide Band Spectrogram
0.3
8 kHz
kHz 8
Total Energy
dB
0.0
0.1
0.2
Zero Crossing Rate
8 kHz
kHz 8
Total Energy
dB
8
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
dB
Energy -- 125 Hz to 750 Hz
dB
8
8
7
7
7
7
6
6
6
6
6
5
5
5
5
5
Wide Band Spectrogram
kHz 4
4 kHz
8
Wide Band Spectrogram
kHz 4
4 kHz
4 kHz
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
Waveform
0.0
0.1
Waveform
0.2
0.3
0.4
0.5
0.6
0.7
0.0
0.1
fee
/fi¤ /
6.345 Automatic Speech Recognition
Waveform
0.2
0.3
0.4
thief
/Ti¤ f/
0.5
0.6
0.7
0.0
0.1
0
Waveform
0.2
0.3
0.4
0.5
see
/si¤ /
0.6
0.7
0.8
0.0
0.1
0.2
0.3
0.4
0.5
0.6
she
/Si¤ /
Speech Sounds 10
0.7
Fricative Energy
0.06
0.04
0.02
0.0
Probability Density unadjusted for frequency
NON-STRIDENT
STRIDENT
-100
-90
-80
-70
-60
-50
-40
Average Total Energy
Strident fricatives tend to be stronger than non-strident fricatives.
6.345 Automatic Speech Recognition
Speech Sounds 11
Fricative Durations
12
10
8
6
4
2
0
Probability Density unadjusted for frequency
14
UNVOICED
VOICED
0.0
0.05
0.10
0.15
0.20
0.25
0.30
Duration
Voiced fricatives tend to be shorter than unvoiced fricatives.
6.345 Automatic Speech Recognition
Speech Sounds 12
Examples of Fricative Voicing Contrast
16
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
0.5
0.6
0.7
kHz 8
0
16
16
0.5
0.6
0.7
16
16
0.7
16
16
0
0
0
0
0
dB
dB
dB
dB
dB
dB
dB
8
8
8
7
7
7
7
6
6
6
5
5
5
Wide Band Spectrogram
kHz 4
4 kHz
0.3
Time (seconds)
0.4
0.5
0.6
0.7
0.8
16
8 kHz
0
Total Energy
dB
dB
dB
dB
Energy -- 125 Hz to 750 Hz
8
0.0
0.1
0.2
Zero Crossing Rate
0
Total Energy
dB
Energy -- 125 Hz to 750 Hz
kHz 4
0.6
8 kHz
kHz 8
Energy -- 125 Hz to 750 Hz
Wide Band Spectrogram
0.5
8 kHz
kHz 8
Total Energy
dB
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
8 kHz
kHz 8
Total Energy
dB
8
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
dB
Energy -- 125 Hz to 750 Hz
dB
8
8
7
7
7
7
6
6
6
6
6
5
5
5
5
5
Wide Band Spectrogram
kHz 4
4 kHz
8
Wide Band Spectrogram
kHz 4
4 kHz
4 kHz
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
Waveform
0.0
0.1
Waveform
0.2
0.3
0.4
0.5
0.6
0.7
0.0
0.1
sue
/su/
6.345 Automatic Speech Recognition
Waveform
0.2
0.3
0.4
zoo
/zu/
0.5
0.6
0.7
0.0
0.1
0
Waveform
0.2
0.3
0.4
face
/fe¤ s/
0.5
0.6
0.7
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
phase
/fe¤ z/
Speech Sounds 13
0.8
Rob
's
Friendly Little Consonant Chart
"Somewhat more accurate, yet somewhat less useful."
The Semi-vowels:
y
Stop
Fricative
Nasal
Manner of Articulation
Place of Articulation
Labial
Dental
pb
Alveolar
Palatal
td
Velar
kg
is like an extreme
i
w is like an extreme u
l
is like an extreme
o
r
is like an extreme
5
The Odds and Ends:
fv
TD
Weak (Non-strident)
sz
SZ
h (unvoiced h)
H (voiced h)
Strong (Strident)
F (flap)
m
n
Voicing: Unvoiced Voiced
4
FÊ (nasalized flap)
? (glottal stop)
The Affricates:
C is like t+S
J is like d+Z
6.345 Automatic Speech Recognition
Speech Sounds 14
What is this word?
0.0
0.1
0.2
16
Zero Crossing Rate
kHz 8
0.3
0.4
Time (seconds)
0.5
0.6
0.7
0.8
0.9
1.0
1.1
16
8 kHz
0
0
Total Energy
dB
dB
Energy -- 125 Hz to 750 Hz
dB
8
dB
8
Wide Band Spectrogram
7
7
6
6
5
5
kHz 4
4 kHz
3
3
2
2
1
1
0
0
Waveform
0.0
0.1
6.345 Automatic Speech Recognition
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Speech Sounds 15
Stop Production
• Complete closure in the vocal tract, pressure build up
• Sudden release of the constriction, turbulence noise
• Can have periodic excitation during closure
[b]
6.345 Automatic Speech Recognition
[d]
[g]
Speech Sounds 16
Stops of American English
• There are 6 stop consonants in American English
• Three places of articulation: Labial, Alveolar, and Velar
• Each place of articulation has a voiced and unvoiced stop
Type
Labial
Alveolar
Velar
Voiced
/b/ b bought
/d/ d dot
/g/ g got
Unvoiced
/p/ p pot
/t/ t tot
/k/ k cot
• Unvoiced stops are typically aspirated
• Voiced stops usually exhibit a “voice-bar’’ during closure
• Information about formant transitions and release useful for
classification
6.345 Automatic Speech Recognition
Speech Sounds 17
Spectrograms of Unvoiced Stops
16
Time (seconds)
0.0
0.1
0.2
0.3
Zero Crossing Rate
0.4
kHz 8
16
8 kHz
0
0
16
0.4
kHz 8
16
8 kHz
0
Total Energy
0
16
dB
dB
dB
dB
Energy -- 125 Hz to 750 Hz
8
8
7
7
6
5
Wide Band Spectrogram
kHz 4
0.5
0.6
0.7
16
8 kHz
0
0
Total Energy
dB
dB
dB
dB
dB
Energy -- 125 Hz to 750 Hz
Energy -- 125 Hz to 750 Hz
dB
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
kHz 8
Total Energy
dB
8
Time (seconds)
0.0
0.1
0.2
0.3
Zero Crossing Rate
dB
8
8
7
7
7
7
6
6
6
6
6
5
5
5
5
5
4 kHz
Wide Band Spectrogram
kHz 4
4 kHz
8
Wide Band Spectrogram
kHz 4
4 kHz
3
3
3
3
3
3
2
2
2
2
2
2
1
1
1
1
1
1
0
0
0
0
0
Waveform
0.0
0.1
Waveform
0.2
0.3
0.4
poop
/pup/
6.345 Automatic Speech Recognition
0.0
0.1
0
Waveform
0.2
0.3
toot
/tut/
0.4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
kook
/kuk/
Speech Sounds 18
Examples of Stop Voicing Contrast
16
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
0.5
0.6
0.7
kHz 8
16
8 kHz
0
0
16
0.5
0.6
0.7
kHz 8
16
8 kHz
0
Total Energy
0
Total Energy
dB
dB
dB
dB
dB
Energy -- 125 Hz to 750 Hz
dB
Energy -- 125 Hz to 750 Hz
dB
8
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
dB
8
8
7
7
7
7
6
6
6
6
5
5
5
5
Wide Band Spectrogram
kHz 4
4 kHz
8
Wide Band Spectrogram
kHz 4
4 kHz
3
3
3
3
2
2
2
2
1
1
1
1
0
0
0
Waveform
0.0
0.1
0
Waveform
0.2
0.3
0.4
pop
/pap/
6.345 Automatic Speech Recognition
0.5
0.6
0.7
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
bob
/bab/
Speech Sounds 19
Singleton Stop Durations
VOT Duration (ms)
80
70
60
50
40
30
20
10
0
b
d
g
p
t
k
Voice onset times (VOTs) are longer for unvoiced stops.
6.345 Automatic Speech Recognition
Speech Sounds 20
Voicing Cues for Stops
There are many voicing cues for a stop.
6.345 Automatic Speech Recognition
Speech Sounds 21
/s/-Stop Durations
VOT Duration (ms)
80
70
60
50
40
30
20
10
0
p
t
k
Unvoiced stops are unaspirated in /s/ stop sequences.
6.345 Automatic Speech Recognition
Speech Sounds 22
Examples of Front and Back Velars
16
0.0
0.1
0.2
Zero Crossing Rate
Time (seconds)
0.3
0.4
0.5
0.6
0.7
kHz 8
16
8 kHz
0
0
16
Time (seconds)
0.3
0.4
0.5
0.6
0.7
kHz 8
16
8 kHz
0
Total Energy
0
Total Energy
dB
dB
dB
dB
dB
Energy -- 125 Hz to 750 Hz
dB
Energy -- 125 Hz to 750 Hz
dB
8
0.0
0.1
0.2
Zero Crossing Rate
dB
8
8
7
7
7
7
6
6
6
6
5
5
5
5
Wide Band Spectrogram
kHz 4
4 kHz
8
Wide Band Spectrogram
kHz 4
4 kHz
3
3
3
3
2
2
2
2
1
1
1
1
0
0
0
Waveform
0.0
0.1
0
Waveform
0.2
0.3
0.4
keep
/ki¤ p/
6.345 Automatic Speech Recognition
0.5
0.6
0.7
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
cot
/kOt/
Speech Sounds 23
What is this word?
16
0.0
0.1
0.2
Zero Crossing Rate
0.3
Time (seconds)
0.4
0.5
0.6
0.7
0.8
kHz 8
16
8 kHz
0
0
Total Energy
dB
dB
Energy -- 125 Hz to 750 Hz
dB
8
dB
8
Wide Band Spectrogram
7
7
6
6
5
5
kHz 4
4 kHz
3
3
2
2
1
1
0
0
Waveform
0.0
6.345 Automatic Speech Recognition
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Speech Sounds 24
Nasal Production
• Velum lowering results in airflow through nasal cavity
• Consonants produced with closure in oral cavity
• Nasal murmurs have similar spectral characteristics
[m]
6.345 Automatic Speech Recognition
[n]
[4]
Speech Sounds 25
Nasal of American English
• Three places of articulation: Labial, Alveolar, and Velar
Type
Labial
Alveolar
Velar
Nasal
/m/ m me
/n/
n knee
/4/ ng sing
• Nasal consonants are always attached to a vowel, though can
form an entire syllable in unstressed environments ([nÍ], [mÍ], [4Í])
• /4/ is always post-vocalic in English
• Place identified by neighboring formant transitions
6.345 Automatic Speech Recognition
Speech Sounds 26
Spectrograms of Nasals
16
0.0
0.1
0.2
Zero Crossing Rate
0.3
Time (seconds)
0.4
0.5
0.6
0.7
0.8
kHz 8
0
16
16
0.3
Time (seconds)
0.4
0.5
0.6
0.7
0.8
16
16
8 kHz kHz 8
8 kHz kHz 8
0
0
0
Total Energy
dB
dB
dB
dB
Energy -- 125 Hz to 750 Hz
8
8
7
7
6
5
Wide Band Spectrogram
kHz 4
0.5
0.6
0.7
16
8 kHz
0
Total Energy
dB
dB
dB
dB
Energy -- 125 Hz to 750 Hz
dB
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
0
Total Energy
dB
8
0.0
0.1
0.2
Zero Crossing Rate
dB
Energy -- 125 Hz to 750 Hz
dB
8
8
7
7
7
7
6
6
6
6
6
5
5
5
5
5
Wide Band Spectrogram
4 kHz kHz 4
8
Wide Band Spectrogram
4 kHz kHz 4
4 kHz
3
3
3
3
3
3
2
2
2
2
2
2
1
1
1
1
1
1
0
0
0
0
0
Waveform
0.0
0.1
Waveform
0.2
0.3
0.4
0.5
0.6
0.7
simmer
/sIm5/
6.345 Automatic Speech Recognition
0.8
0.0
0.1
0
Waveform
0.2
0.3
0.4
0.5
sinner
/sIn5/
0.6
0.7
0.8
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
singer
/sI45/
Speech Sounds 27
What is this word?
16
0.0
0.1
0.2
Zero Crossing Rate
0.3
Time (seconds)
0.4
0.5
0.6
0.7
0.8
kHz 8
16
8 kHz
0
0
Total Energy
dB
dB
Energy -- 125 Hz to 750 Hz
dB
8
dB
8
Wide Band Spectrogram
7
7
6
6
5
5
kHz 4
4 kHz
3
3
2
2
1
1
0
0
Waveform
0.0
6.345 Automatic Speech Recognition
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Speech Sounds 28
Semivowel Production
• Constriction in vocal tract, no turbulence
• Slower articulatory motion than other consonants
• Laterals form complete closure with tongue tip,
airflow via sides of constriction
[w]
6.345 Automatic Speech Recognition
[y]
[r]
[l]
Speech Sounds 29
Semivowels of American English
• There are 4 semivowels in American English
• Sometimes referred to as Liquids or Glides
Type
Glides
Liquids
Semivowel
/w/ w wet
/y/ y yet
/r/
r red
/l/
l let
Nearest Vowel
/u/
/i/
/5/
/o/
• Glides are a more extreme articulation of a corresponding vowel
– Similar, though more extreme, formant positions
– Generally weaker due to narrower constriction
• Semivowels are always attached to a vowel, though /l/ can form
an entire syllable in unstressed environments ([lÍ ])
6.345 Automatic Speech Recognition
Speech Sounds 30
Spectrograms of Semivowels
16
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
0.5
0.6
0.7
kHz 8
0
16
16
0.5
0.6
0.7
16
16
0.7
16
16
0
0
0
0
0
dB
dB
dB
dB
dB
dB
dB
8
8
8
7
7
7
7
6
6
6
5
5
5
Wide Band Spectrogram
kHz 4
4 kHz
Time (seconds)
0.3
0.4
0.5
0.6
0.7
16
8 kHz
0
Total Energy
dB
dB
dB
dB
Energy -- 125 Hz to 750 Hz
8
0.0
0.1
0.2
Zero Crossing Rate
0
Total Energy
dB
Energy -- 125 Hz to 750 Hz
kHz 4
0.6
8 kHz
kHz 8
Energy -- 125 Hz to 750 Hz
Wide Band Spectrogram
0.5
8 kHz
kHz 8
Total Energy
dB
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
8 kHz
kHz 8
Total Energy
dB
8
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
dB
Energy -- 125 Hz to 750 Hz
dB
8
8
7
7
7
7
6
6
6
6
6
5
5
5
5
5
Wide Band Spectrogram
kHz 4
4 kHz
8
Wide Band Spectrogram
kHz 4
4 kHz
4 kHz
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
Waveform
0.0
0.1
Waveform
0.2
0.3
0.4
0.5
0.6
0.7
0.0
we
/wi¤ /
6.345 Automatic Speech Recognition
0.1
Waveform
0.2
0.3
0.4
ye
/yi¤ /
0.5
0.6
0.7
0.0
0.1
0
Waveform
0.2
0.3
0.4
reed
/ri¤ d/
0.5
0.6
0.7
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
lee
/li¤ /
Speech Sounds 31
Acoustic Properties of Semivowels
• /w/ and /l/ are the most confusable semivowels
• /w/ is characterized by a very low F1, F2
– Typically a rapid spectral falloff above F2
• /l/ is characterized by a low F1 and F2
– Often presence of high frequency energy
– Postvocalic /l/ characterized by minimal spectral discontinuity,
gradual motion of formants
• /y/ is characterized by very low F1, very high F2
– /y/ only occurs in a syllable onset position (i.e., pre-vocalic)
• /r/ is characterized by a very low F3
– Prevocalic F3 < medial F3 < postvocalic F3
6.345 Automatic Speech Recognition
Speech Sounds 32
What is this word?
16
0.0
0.1
0.2
Zero Crossing Rate
0.3
0.4
0.5
Time (seconds)
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
kHz 8
16
8 kHz
0
0
Total Energy
dB
dB
Energy -- 125 Hz to 750 Hz
dB
8
dB
8
Wide Band Spectrogram
7
7
6
6
5
5
kHz 4
4 kHz
3
3
2
2
1
1
0
0
Waveform
0.0
0.1
0.2
6.345 Automatic Speech Recognition
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
Speech Sounds 33
Affricate Production
• There are two affricates in American English:
Voiced
/J/ jh judge
Unvoiced
/C/ ch church
• Alveolar-stop palatal-fricative pairs
• Sudden release of the constriction, turbulence noise
• Can have periodic excitation during closure
6.345 Automatic Speech Recognition
Speech Sounds 34
Aspirant Production
• There is one aspirant in American English: /h/ (e.g., “hat’’)
• Produced by generating turbulence excitation at glottis
• No constriction in the vocal tract, normal formant excitation
• Sub-glottal coupling results in little energy in F1 region
• Periodic excitation can be present in medial position
6.345 Automatic Speech Recognition
Speech Sounds 35
Spectrograms of Affricates and Aspirant
16
0.0
0.1
0.2
Zero Crossing Rate
Time (seconds)
0.3
0.4
0.5
0.6
0.7
kHz 8
16
8 kHz
0
0
16
0.3
Time (seconds)
0.4
0.5
0.6
0.7
0.8
kHz 8
16
8 kHz
0
Total Energy
0
Total Energy
dB
dB
dB
dB
dB
Energy -- 125 Hz to 750 Hz
dB
Energy -- 125 Hz to 750 Hz
dB
8
0.0
0.1
0.2
Zero Crossing Rate
dB
8
8
7
7
7
7
6
6
6
6
5
5
5
5
Wide Band Spectrogram
kHz 4
4 kHz
8
Wide Band Spectrogram
kHz 4
4 kHz
3
3
3
3
2
2
2
2
1
1
1
1
0
0
0
Waveform
0.0
0.1
0
Waveform
0.2
0.3
0.4
each
/i¤ C/
6.345 Automatic Speech Recognition
0.5
0.6
0.7
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
huge
/hyuJ/
Speech Sounds 36
What is this word?
16
0.0
0.1
0.2
Zero Crossing Rate
0.3
Time (seconds)
0.4
0.5
0.6
0.7
0.8
kHz 8
16
8 kHz
0
0
Total Energy
dB
dB
Energy -- 125 Hz to 750 Hz
dB
8
dB
8
Wide Band Spectrogram
7
7
6
6
5
5
kHz 4
4 kHz
3
3
2
2
1
1
0
0
Waveform
0.0
6.345 Automatic Speech Recognition
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Speech Sounds 37
Phonotactic Constraints
• Phonotactics is the study of allowable sound sequences
• Analyses of word-initial and -final clusters reveal:
– 73 distinct initial clusters (about 10 “foreign’’ clusters)
– 208 distinct final clusters
• Can be used to eliminate impossible phoneme sequences:
– /tk/ can’t end a word, and
– /kt/ can’t begin a word,
– Therefore, */. . . t k t . . ./ is an impossible sequence
6.345 Automatic Speech Recognition
Speech Sounds 38
Word-Initial Consonants from MWP Dictionary
b
bl
br
by
C
d
dr
dw
f
fl
fr
fy
g
gl
gr
gw
h
hw
of
be
black
bring
beauty
child
do
drive
dwell
for
floor
from
few
good
glass
great
guava
he
which
hy
J
k
kl
kr
kw
ky
l
m
mw
my
n
p
pl
pr
pw
py
r
s
6.345 Automatic Speech Recognition
human
just
can
class
cross
quite
curious
like
more
moire
music
not
people
place
price
pueblo
pure
right
so
sf
sk
skl
skr
skw
sky
sl
sm
sn
sp
spl
spr
spy
st
str
sw
S
Sr
t
sphere
school
sclerosis
screen
square
skewer
slow
small
snake
special
split
spring
spurious
state
street
sweet
she
shrewd
to
tr
ts
tw
ty
T
Tr
Tw
D
v
vw
vy
w
y
z
zl
zw
Z
true
tsunami
twenty
tuesday
thief
through
thwart
the
very
voyager
view
was
you
zero
zloty
zweiback
genre
Speech Sounds 39
The Syllable
• Syllable structure captures many useful generalizations
– Phoneme realization often depends on syllabification
– Many phonological rules depend on syllable structure
• Syllable structure is predicated on the notion of ranking the
speech sounds in terms of their sonority values
Sounds
Low Vowels
Mid Vowels
High Vowels
Flaps
Laterals
Nasals
Voiced Fricatives
Unvoiced Fricatives
Voiced Stops
Unvoiced Stops
6.345 Automatic Speech Recognition
Sonority Values
10
9
8
7
6
5
4
3
2
1
Examples
/a, O/
/e, o/
/i, u/
/r/
/l/
/m, n, 4/
/v, D, z/
/f, T, s/
/b, d, g/
/p, t, k/
Speech Sounds 40
Syllables and Sonority
• Utterances can be divided into syllables
• The number of syllables equals the number of sonority peaks
• Within any syllable, there is a segment constituting a sonority
peak that is preceded and/or followed by a sequence of segments
with progressively decreasing sonority values
s
3
u
8
p
1
r
7
^
9
m
5
I
8
n
5
I
8
6.345 Automatic Speech Recognition
suprasegmental
s
E
g
m
3 9
2
5
minimization
m a¤ z
e
5 10 4
9
fire
f
a¤ }
3 10 (8) 9
E
9
n
5
t
1
S
3
{
9
n
5
{
9
l
6
Speech Sounds 41
The Syllable Template
Onset
Nucleus
Coda
Affix
Rhyme
• Branches marked by ◦ are optional
• Nucleus must contain a non-obstruent
• Sonority decreases away from nucleus
• Affix contains only coronals: /s, z, t, d, T, D, C, J/
• Only the last syllable in a word can have an affix
• /sp/, /st/, and /sk/ are treated as single obstruents
6.345 Automatic Speech Recognition
Speech Sounds 42
Some Examples
crown
fledged
links
dwarves
stick
sixths
6.345 Automatic Speech Recognition
Outer
Onset
Inner
Onset
k
f
l
d
st
s
r
l
w
Nucleus
Inner
Coda
Outer
Coda
a
E
I
a
I
I
w
n
J
k
v
k
k
4
r
Affix
1
Affix
2
Affix
3
T
s
d
s
z
s
Speech Sounds 43
Words Containing /r/ and /l/
Outer
Onset
rock
crock
curt
cart
car
lick
bottle
kill
6.345 Automatic Speech Recognition
r
k
k
k
k
l
b
k
Inner
Onset
r
Nucleus
a
a
5
a
a
I
a,lÍ
I
Inner
Coda
r
Outer
Coda
Affix
1
Affix
2
Affix
3
k
k
t
t
r
k
t
l
Speech Sounds 44
Acoustic Realizations of /r/
rock
/rak/
6.345 Automatic Speech Recognition
curt
/k5t/
car
/kar/
Speech Sounds 45
Acoustic Realizations of /l/
lick
/lIk/
6.345 Automatic Speech Recognition
bottle
/batlÍ /
kill
/kIl/
Speech Sounds 46
Allophonic Variations at Syllable Boundaries
nitrate
/na¤ tre¤ t/
6.345 Automatic Speech Recognition
night rate
/na¤ t re¤ t/
Speech Sounds 47
Assignment 2
6.345 Automatic Speech Recognition
Speech Sounds 48
Download