Lecture # 3-4 Session 2003 Speech Sounds of American English • There are over 40 speech sounds in American English which can be organized by their basic manner of production Manner Class Vowels Fricatives Stops Nasals Semivowels Affricates Aspirant Number 18 8 6 3 4 2 1 • Vowels, glides, and consonants differ in degree of constriction • Sonorant consonants have no pressure build up at constriction • Nasal consonants lower the velum allowing airflow in nasal cavity • Continuant consonants do not block airflow in oral cavity 6.345 Automatic Speech Recognition Speech Sounds 1 Vowel Production • No significant constriction in the vocal tract • Usually produced with periodic excitation • Acoustic characteristics depend on the position of the jaw, tongue, and lips [i] 6.345 Automatic Speech Recognition [@] [a] [u] Speech Sounds 2 Vowels of American English • There are approximately 18 vowels in American English made up of monothongs, diphthongs, and reduced vowels (schwa’s) /i¤ / /I/ /e¤ / /E/ /@/ /a/ iy ih ey eh ae aa beat bit bait bet bat Bob /O/ /^/ /o⁄ / /U/ /u/ /5/ ao ah ow uh uw er bought but boat book boot Bert /a¤ / /O¤ / /a⁄ / [{] [|] [}] ay oy aw ax ix axr bite Boyd bout about roses butter • They are often described by the articulatory features: High/Low, Front/Back, Retroflexed, Rounded, and Tense/Lax 6.345 Automatic Speech Recognition Speech Sounds 3 Spectrograms of the Cardinal Vowels 16 Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate 0.5 0.6 0.7 kHz 8 0 16 16 0.5 0.6 0.7 16 16 0.7 16 16 0 0 0 0 0 dB dB dB dB dB dB dB 8 8 8 7 7 7 7 6 6 6 5 5 5 Wide Band Spectrogram kHz 4 4 kHz 0.5 0.6 0.7 16 8 kHz 0 Total Energy dB dB dB dB Energy -- 125 Hz to 750 Hz 8 Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate 0 Total Energy dB Energy -- 125 Hz to 750 Hz kHz 4 0.6 8 kHz kHz 8 Energy -- 125 Hz to 750 Hz Wide Band Spectrogram 0.5 8 kHz kHz 8 Total Energy dB Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate 8 kHz kHz 8 Total Energy dB 8 Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate dB Energy -- 125 Hz to 750 Hz dB 8 8 7 7 7 7 6 6 6 6 6 5 5 5 5 5 Wide Band Spectrogram kHz 4 4 kHz 8 Wide Band Spectrogram kHz 4 4 kHz 4 kHz 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 Waveform 0.0 0.1 Waveform 0.2 0.3 0.4 0.5 0.6 0.7 0.0 beet /bi¤ t/ 6.345 Automatic Speech Recognition 0.1 Waveform 0.2 0.3 0.4 bat /b@t/ 0.5 0.6 0.7 0.0 0.1 0 Waveform 0.2 0.3 0.4 bott /bat/ 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 boot /but/ Speech Sounds 4 Vowel Formant Averages • Vowels are often characterized by the lower three formants • High/Low is correlated with the first formant, F1 • Front/Back is correlated with the second formant, F2 • Retroflexion is marked by a low third formant, F3 Female Speakers Male Speakers 3500 3500 F3 F2 F1 F3 F1 3000 Average Frequency (Hz) Average Frequency (Hz) 3000 F2 2500 2000 1500 1000 2500 2000 1500 1000 500 500 0 0 i¤ I e¤ E @ a O ^ o⁄ U u 5 { Vowel 6.345 Automatic Speech Recognition | i¤ I e¤ E @ a O ^ o⁄ U u 5 { | Vowel Speech Sounds 5 Vowel Durations • Each vowel has a different intrinsic duration • Schwa’s have distinctly shorter durations (50ms) • /I, E, ^, U/ are the shortest monothongs • Context can greatly influence vowel duration Male Speakers 250 250 200 200 Average Duration (ms) Average Duration (ms) Female Speakers 150 100 50 0 150 100 50 0 i¤ I e¤ E @ a O ^ o⁄ U u 5 { | a⁄ o¤ a¤ ¤u i¤ I e¤ E @ a O ^ o⁄ U u 5 { | a⁄ o¤ a¤ ¤u Vowel Vowel 6.345 Automatic Speech Recognition Speech Sounds 6 Rob 's Happy Little Vowel Chart "So inaccurate, yet so useful." FRONT k F3 n i Th i F2 Increases I uÚ e E ow ? o u HIGH Yo u r p a l 5 is go! o t ay the w @ ^,{ U BACK is might yl TENSE = Towards Edges tends to be longer O MID a LOW LAX = Towards Center tends to be shorter SCHWAS: Plain [{] About [{ba⁄t] Front [|] Roses [ro⁄z|z] Retroflex [}] Forever [f}Ev}] F1 Increases 6.345 Automatic Speech Recognition Speech Sounds 7 Fricative Production • Turbulence produced at narrow constriction • Constriction position determines acoustic characteristics • Can be produced with periodic excitation [f] 6.345 Automatic Speech Recognition [T] [s] [S] Speech Sounds 8 Fricatives of American English • There are 8 fricatives in American English • Four places of articulation: Labio-Dental (Labial), Interdental (Dental), Alveolar, and Palato-Alveolar (Palatal) • They are often described by the features Voiced/Unvoiced, or Strident/Non-Strident (constriction behind alveolar ridge) Type Labial Dental Alveolar Palatal 6.345 Automatic Speech Recognition Unvoiced /f/ f fee /T/ th thief /s/ s see /S/ sh she /v/ /D/ /z/ /Z/ Voiced v v dh thee z z zh Gigi Speech Sounds 9 Spectrograms of Unvoiced Fricatives 16 Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate 0.5 0.6 0.7 kHz 8 0 16 16 0.5 0.6 0.7 16 16 0.6 0.7 0.8 16 16 0 0 0 0 0 dB dB dB dB dB dB dB 8 8 8 7 7 7 7 6 6 6 5 5 5 Wide Band Spectrogram kHz 4 4 kHz Time (seconds) 0.3 0.4 0.5 0.6 0.7 16 8 kHz 0 Total Energy dB dB dB dB Energy -- 125 Hz to 750 Hz 8 0.0 0.1 0.2 Zero Crossing Rate 0 Total Energy dB Energy -- 125 Hz to 750 Hz kHz 4 Time (seconds) 0.4 0.5 8 kHz kHz 8 Energy -- 125 Hz to 750 Hz Wide Band Spectrogram 0.3 8 kHz kHz 8 Total Energy dB 0.0 0.1 0.2 Zero Crossing Rate 8 kHz kHz 8 Total Energy dB 8 Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate dB Energy -- 125 Hz to 750 Hz dB 8 8 7 7 7 7 6 6 6 6 6 5 5 5 5 5 Wide Band Spectrogram kHz 4 4 kHz 8 Wide Band Spectrogram kHz 4 4 kHz 4 kHz 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 Waveform 0.0 0.1 Waveform 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 fee /fi¤ / 6.345 Automatic Speech Recognition Waveform 0.2 0.3 0.4 thief /Ti¤ f/ 0.5 0.6 0.7 0.0 0.1 0 Waveform 0.2 0.3 0.4 0.5 see /si¤ / 0.6 0.7 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 she /Si¤ / Speech Sounds 10 0.7 Fricative Energy 0.06 0.04 0.02 0.0 Probability Density unadjusted for frequency NON-STRIDENT STRIDENT -100 -90 -80 -70 -60 -50 -40 Average Total Energy Strident fricatives tend to be stronger than non-strident fricatives. 6.345 Automatic Speech Recognition Speech Sounds 11 Fricative Durations 12 10 8 6 4 2 0 Probability Density unadjusted for frequency 14 UNVOICED VOICED 0.0 0.05 0.10 0.15 0.20 0.25 0.30 Duration Voiced fricatives tend to be shorter than unvoiced fricatives. 6.345 Automatic Speech Recognition Speech Sounds 12 Examples of Fricative Voicing Contrast 16 Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate 0.5 0.6 0.7 kHz 8 0 16 16 0.5 0.6 0.7 16 16 0.7 16 16 0 0 0 0 0 dB dB dB dB dB dB dB 8 8 8 7 7 7 7 6 6 6 5 5 5 Wide Band Spectrogram kHz 4 4 kHz 0.3 Time (seconds) 0.4 0.5 0.6 0.7 0.8 16 8 kHz 0 Total Energy dB dB dB dB Energy -- 125 Hz to 750 Hz 8 0.0 0.1 0.2 Zero Crossing Rate 0 Total Energy dB Energy -- 125 Hz to 750 Hz kHz 4 0.6 8 kHz kHz 8 Energy -- 125 Hz to 750 Hz Wide Band Spectrogram 0.5 8 kHz kHz 8 Total Energy dB Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate 8 kHz kHz 8 Total Energy dB 8 Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate dB Energy -- 125 Hz to 750 Hz dB 8 8 7 7 7 7 6 6 6 6 6 5 5 5 5 5 Wide Band Spectrogram kHz 4 4 kHz 8 Wide Band Spectrogram kHz 4 4 kHz 4 kHz 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 Waveform 0.0 0.1 Waveform 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 sue /su/ 6.345 Automatic Speech Recognition Waveform 0.2 0.3 0.4 zoo /zu/ 0.5 0.6 0.7 0.0 0.1 0 Waveform 0.2 0.3 0.4 face /fe¤ s/ 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 phase /fe¤ z/ Speech Sounds 13 0.8 Rob 's Friendly Little Consonant Chart "Somewhat more accurate, yet somewhat less useful." The Semi-vowels: y Stop Fricative Nasal Manner of Articulation Place of Articulation Labial Dental pb Alveolar Palatal td Velar kg is like an extreme i w is like an extreme u l is like an extreme o r is like an extreme 5 The Odds and Ends: fv TD Weak (Non-strident) sz SZ h (unvoiced h) H (voiced h) Strong (Strident) F (flap) m n Voicing: Unvoiced Voiced 4 FÊ (nasalized flap) ? (glottal stop) The Affricates: C is like t+S J is like d+Z 6.345 Automatic Speech Recognition Speech Sounds 14 What is this word? 0.0 0.1 0.2 16 Zero Crossing Rate kHz 8 0.3 0.4 Time (seconds) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 16 8 kHz 0 0 Total Energy dB dB Energy -- 125 Hz to 750 Hz dB 8 dB 8 Wide Band Spectrogram 7 7 6 6 5 5 kHz 4 4 kHz 3 3 2 2 1 1 0 0 Waveform 0.0 0.1 6.345 Automatic Speech Recognition 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Speech Sounds 15 Stop Production • Complete closure in the vocal tract, pressure build up • Sudden release of the constriction, turbulence noise • Can have periodic excitation during closure [b] 6.345 Automatic Speech Recognition [d] [g] Speech Sounds 16 Stops of American English • There are 6 stop consonants in American English • Three places of articulation: Labial, Alveolar, and Velar • Each place of articulation has a voiced and unvoiced stop Type Labial Alveolar Velar Voiced /b/ b bought /d/ d dot /g/ g got Unvoiced /p/ p pot /t/ t tot /k/ k cot • Unvoiced stops are typically aspirated • Voiced stops usually exhibit a “voice-bar’’ during closure • Information about formant transitions and release useful for classification 6.345 Automatic Speech Recognition Speech Sounds 17 Spectrograms of Unvoiced Stops 16 Time (seconds) 0.0 0.1 0.2 0.3 Zero Crossing Rate 0.4 kHz 8 16 8 kHz 0 0 16 0.4 kHz 8 16 8 kHz 0 Total Energy 0 16 dB dB dB dB Energy -- 125 Hz to 750 Hz 8 8 7 7 6 5 Wide Band Spectrogram kHz 4 0.5 0.6 0.7 16 8 kHz 0 0 Total Energy dB dB dB dB dB Energy -- 125 Hz to 750 Hz Energy -- 125 Hz to 750 Hz dB Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate kHz 8 Total Energy dB 8 Time (seconds) 0.0 0.1 0.2 0.3 Zero Crossing Rate dB 8 8 7 7 7 7 6 6 6 6 6 5 5 5 5 5 4 kHz Wide Band Spectrogram kHz 4 4 kHz 8 Wide Band Spectrogram kHz 4 4 kHz 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1 1 1 1 0 0 0 0 0 Waveform 0.0 0.1 Waveform 0.2 0.3 0.4 poop /pup/ 6.345 Automatic Speech Recognition 0.0 0.1 0 Waveform 0.2 0.3 toot /tut/ 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 kook /kuk/ Speech Sounds 18 Examples of Stop Voicing Contrast 16 Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate 0.5 0.6 0.7 kHz 8 16 8 kHz 0 0 16 0.5 0.6 0.7 kHz 8 16 8 kHz 0 Total Energy 0 Total Energy dB dB dB dB dB Energy -- 125 Hz to 750 Hz dB Energy -- 125 Hz to 750 Hz dB 8 Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate dB 8 8 7 7 7 7 6 6 6 6 5 5 5 5 Wide Band Spectrogram kHz 4 4 kHz 8 Wide Band Spectrogram kHz 4 4 kHz 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 Waveform 0.0 0.1 0 Waveform 0.2 0.3 0.4 pop /pap/ 6.345 Automatic Speech Recognition 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 bob /bab/ Speech Sounds 19 Singleton Stop Durations VOT Duration (ms) 80 70 60 50 40 30 20 10 0 b d g p t k Voice onset times (VOTs) are longer for unvoiced stops. 6.345 Automatic Speech Recognition Speech Sounds 20 Voicing Cues for Stops There are many voicing cues for a stop. 6.345 Automatic Speech Recognition Speech Sounds 21 /s/-Stop Durations VOT Duration (ms) 80 70 60 50 40 30 20 10 0 p t k Unvoiced stops are unaspirated in /s/ stop sequences. 6.345 Automatic Speech Recognition Speech Sounds 22 Examples of Front and Back Velars 16 0.0 0.1 0.2 Zero Crossing Rate Time (seconds) 0.3 0.4 0.5 0.6 0.7 kHz 8 16 8 kHz 0 0 16 Time (seconds) 0.3 0.4 0.5 0.6 0.7 kHz 8 16 8 kHz 0 Total Energy 0 Total Energy dB dB dB dB dB Energy -- 125 Hz to 750 Hz dB Energy -- 125 Hz to 750 Hz dB 8 0.0 0.1 0.2 Zero Crossing Rate dB 8 8 7 7 7 7 6 6 6 6 5 5 5 5 Wide Band Spectrogram kHz 4 4 kHz 8 Wide Band Spectrogram kHz 4 4 kHz 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 Waveform 0.0 0.1 0 Waveform 0.2 0.3 0.4 keep /ki¤ p/ 6.345 Automatic Speech Recognition 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 cot /kOt/ Speech Sounds 23 What is this word? 16 0.0 0.1 0.2 Zero Crossing Rate 0.3 Time (seconds) 0.4 0.5 0.6 0.7 0.8 kHz 8 16 8 kHz 0 0 Total Energy dB dB Energy -- 125 Hz to 750 Hz dB 8 dB 8 Wide Band Spectrogram 7 7 6 6 5 5 kHz 4 4 kHz 3 3 2 2 1 1 0 0 Waveform 0.0 6.345 Automatic Speech Recognition 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Speech Sounds 24 Nasal Production • Velum lowering results in airflow through nasal cavity • Consonants produced with closure in oral cavity • Nasal murmurs have similar spectral characteristics [m] 6.345 Automatic Speech Recognition [n] [4] Speech Sounds 25 Nasal of American English • Three places of articulation: Labial, Alveolar, and Velar Type Labial Alveolar Velar Nasal /m/ m me /n/ n knee /4/ ng sing • Nasal consonants are always attached to a vowel, though can form an entire syllable in unstressed environments ([nÍ], [mÍ], [4Í]) • /4/ is always post-vocalic in English • Place identified by neighboring formant transitions 6.345 Automatic Speech Recognition Speech Sounds 26 Spectrograms of Nasals 16 0.0 0.1 0.2 Zero Crossing Rate 0.3 Time (seconds) 0.4 0.5 0.6 0.7 0.8 kHz 8 0 16 16 0.3 Time (seconds) 0.4 0.5 0.6 0.7 0.8 16 16 8 kHz kHz 8 8 kHz kHz 8 0 0 0 Total Energy dB dB dB dB Energy -- 125 Hz to 750 Hz 8 8 7 7 6 5 Wide Band Spectrogram kHz 4 0.5 0.6 0.7 16 8 kHz 0 Total Energy dB dB dB dB Energy -- 125 Hz to 750 Hz dB Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate 0 Total Energy dB 8 0.0 0.1 0.2 Zero Crossing Rate dB Energy -- 125 Hz to 750 Hz dB 8 8 7 7 7 7 6 6 6 6 6 5 5 5 5 5 Wide Band Spectrogram 4 kHz kHz 4 8 Wide Band Spectrogram 4 kHz kHz 4 4 kHz 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1 1 1 1 0 0 0 0 0 Waveform 0.0 0.1 Waveform 0.2 0.3 0.4 0.5 0.6 0.7 simmer /sIm5/ 6.345 Automatic Speech Recognition 0.8 0.0 0.1 0 Waveform 0.2 0.3 0.4 0.5 sinner /sIn5/ 0.6 0.7 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 singer /sI45/ Speech Sounds 27 What is this word? 16 0.0 0.1 0.2 Zero Crossing Rate 0.3 Time (seconds) 0.4 0.5 0.6 0.7 0.8 kHz 8 16 8 kHz 0 0 Total Energy dB dB Energy -- 125 Hz to 750 Hz dB 8 dB 8 Wide Band Spectrogram 7 7 6 6 5 5 kHz 4 4 kHz 3 3 2 2 1 1 0 0 Waveform 0.0 6.345 Automatic Speech Recognition 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Speech Sounds 28 Semivowel Production • Constriction in vocal tract, no turbulence • Slower articulatory motion than other consonants • Laterals form complete closure with tongue tip, airflow via sides of constriction [w] 6.345 Automatic Speech Recognition [y] [r] [l] Speech Sounds 29 Semivowels of American English • There are 4 semivowels in American English • Sometimes referred to as Liquids or Glides Type Glides Liquids Semivowel /w/ w wet /y/ y yet /r/ r red /l/ l let Nearest Vowel /u/ /i/ /5/ /o/ • Glides are a more extreme articulation of a corresponding vowel – Similar, though more extreme, formant positions – Generally weaker due to narrower constriction • Semivowels are always attached to a vowel, though /l/ can form an entire syllable in unstressed environments ([lÍ ]) 6.345 Automatic Speech Recognition Speech Sounds 30 Spectrograms of Semivowels 16 Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate 0.5 0.6 0.7 kHz 8 0 16 16 0.5 0.6 0.7 16 16 0.7 16 16 0 0 0 0 0 dB dB dB dB dB dB dB 8 8 8 7 7 7 7 6 6 6 5 5 5 Wide Band Spectrogram kHz 4 4 kHz Time (seconds) 0.3 0.4 0.5 0.6 0.7 16 8 kHz 0 Total Energy dB dB dB dB Energy -- 125 Hz to 750 Hz 8 0.0 0.1 0.2 Zero Crossing Rate 0 Total Energy dB Energy -- 125 Hz to 750 Hz kHz 4 0.6 8 kHz kHz 8 Energy -- 125 Hz to 750 Hz Wide Band Spectrogram 0.5 8 kHz kHz 8 Total Energy dB Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate 8 kHz kHz 8 Total Energy dB 8 Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate dB Energy -- 125 Hz to 750 Hz dB 8 8 7 7 7 7 6 6 6 6 6 5 5 5 5 5 Wide Band Spectrogram kHz 4 4 kHz 8 Wide Band Spectrogram kHz 4 4 kHz 4 kHz 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 Waveform 0.0 0.1 Waveform 0.2 0.3 0.4 0.5 0.6 0.7 0.0 we /wi¤ / 6.345 Automatic Speech Recognition 0.1 Waveform 0.2 0.3 0.4 ye /yi¤ / 0.5 0.6 0.7 0.0 0.1 0 Waveform 0.2 0.3 0.4 reed /ri¤ d/ 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 lee /li¤ / Speech Sounds 31 Acoustic Properties of Semivowels • /w/ and /l/ are the most confusable semivowels • /w/ is characterized by a very low F1, F2 – Typically a rapid spectral falloff above F2 • /l/ is characterized by a low F1 and F2 – Often presence of high frequency energy – Postvocalic /l/ characterized by minimal spectral discontinuity, gradual motion of formants • /y/ is characterized by very low F1, very high F2 – /y/ only occurs in a syllable onset position (i.e., pre-vocalic) • /r/ is characterized by a very low F3 – Prevocalic F3 < medial F3 < postvocalic F3 6.345 Automatic Speech Recognition Speech Sounds 32 What is this word? 16 0.0 0.1 0.2 Zero Crossing Rate 0.3 0.4 0.5 Time (seconds) 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 kHz 8 16 8 kHz 0 0 Total Energy dB dB Energy -- 125 Hz to 750 Hz dB 8 dB 8 Wide Band Spectrogram 7 7 6 6 5 5 kHz 4 4 kHz 3 3 2 2 1 1 0 0 Waveform 0.0 0.1 0.2 6.345 Automatic Speech Recognition 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 Speech Sounds 33 Affricate Production • There are two affricates in American English: Voiced /J/ jh judge Unvoiced /C/ ch church • Alveolar-stop palatal-fricative pairs • Sudden release of the constriction, turbulence noise • Can have periodic excitation during closure 6.345 Automatic Speech Recognition Speech Sounds 34 Aspirant Production • There is one aspirant in American English: /h/ (e.g., “hat’’) • Produced by generating turbulence excitation at glottis • No constriction in the vocal tract, normal formant excitation • Sub-glottal coupling results in little energy in F1 region • Periodic excitation can be present in medial position 6.345 Automatic Speech Recognition Speech Sounds 35 Spectrograms of Affricates and Aspirant 16 0.0 0.1 0.2 Zero Crossing Rate Time (seconds) 0.3 0.4 0.5 0.6 0.7 kHz 8 16 8 kHz 0 0 16 0.3 Time (seconds) 0.4 0.5 0.6 0.7 0.8 kHz 8 16 8 kHz 0 Total Energy 0 Total Energy dB dB dB dB dB Energy -- 125 Hz to 750 Hz dB Energy -- 125 Hz to 750 Hz dB 8 0.0 0.1 0.2 Zero Crossing Rate dB 8 8 7 7 7 7 6 6 6 6 5 5 5 5 Wide Band Spectrogram kHz 4 4 kHz 8 Wide Band Spectrogram kHz 4 4 kHz 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 Waveform 0.0 0.1 0 Waveform 0.2 0.3 0.4 each /i¤ C/ 6.345 Automatic Speech Recognition 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 huge /hyuJ/ Speech Sounds 36 What is this word? 16 0.0 0.1 0.2 Zero Crossing Rate 0.3 Time (seconds) 0.4 0.5 0.6 0.7 0.8 kHz 8 16 8 kHz 0 0 Total Energy dB dB Energy -- 125 Hz to 750 Hz dB 8 dB 8 Wide Band Spectrogram 7 7 6 6 5 5 kHz 4 4 kHz 3 3 2 2 1 1 0 0 Waveform 0.0 6.345 Automatic Speech Recognition 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Speech Sounds 37 Phonotactic Constraints • Phonotactics is the study of allowable sound sequences • Analyses of word-initial and -final clusters reveal: – 73 distinct initial clusters (about 10 “foreign’’ clusters) – 208 distinct final clusters • Can be used to eliminate impossible phoneme sequences: – /tk/ can’t end a word, and – /kt/ can’t begin a word, – Therefore, */. . . t k t . . ./ is an impossible sequence 6.345 Automatic Speech Recognition Speech Sounds 38 Word-Initial Consonants from MWP Dictionary b bl br by C d dr dw f fl fr fy g gl gr gw h hw of be black bring beauty child do drive dwell for floor from few good glass great guava he which hy J k kl kr kw ky l m mw my n p pl pr pw py r s 6.345 Automatic Speech Recognition human just can class cross quite curious like more moire music not people place price pueblo pure right so sf sk skl skr skw sky sl sm sn sp spl spr spy st str sw S Sr t sphere school sclerosis screen square skewer slow small snake special split spring spurious state street sweet she shrewd to tr ts tw ty T Tr Tw D v vw vy w y z zl zw Z true tsunami twenty tuesday thief through thwart the very voyager view was you zero zloty zweiback genre Speech Sounds 39 The Syllable • Syllable structure captures many useful generalizations – Phoneme realization often depends on syllabification – Many phonological rules depend on syllable structure • Syllable structure is predicated on the notion of ranking the speech sounds in terms of their sonority values Sounds Low Vowels Mid Vowels High Vowels Flaps Laterals Nasals Voiced Fricatives Unvoiced Fricatives Voiced Stops Unvoiced Stops 6.345 Automatic Speech Recognition Sonority Values 10 9 8 7 6 5 4 3 2 1 Examples /a, O/ /e, o/ /i, u/ /r/ /l/ /m, n, 4/ /v, D, z/ /f, T, s/ /b, d, g/ /p, t, k/ Speech Sounds 40 Syllables and Sonority • Utterances can be divided into syllables • The number of syllables equals the number of sonority peaks • Within any syllable, there is a segment constituting a sonority peak that is preceded and/or followed by a sequence of segments with progressively decreasing sonority values s 3 u 8 p 1 r 7 ^ 9 m 5 I 8 n 5 I 8 6.345 Automatic Speech Recognition suprasegmental s E g m 3 9 2 5 minimization m a¤ z e 5 10 4 9 fire f a¤ } 3 10 (8) 9 E 9 n 5 t 1 S 3 { 9 n 5 { 9 l 6 Speech Sounds 41 The Syllable Template Onset Nucleus Coda Affix Rhyme • Branches marked by ◦ are optional • Nucleus must contain a non-obstruent • Sonority decreases away from nucleus • Affix contains only coronals: /s, z, t, d, T, D, C, J/ • Only the last syllable in a word can have an affix • /sp/, /st/, and /sk/ are treated as single obstruents 6.345 Automatic Speech Recognition Speech Sounds 42 Some Examples crown fledged links dwarves stick sixths 6.345 Automatic Speech Recognition Outer Onset Inner Onset k f l d st s r l w Nucleus Inner Coda Outer Coda a E I a I I w n J k v k k 4 r Affix 1 Affix 2 Affix 3 T s d s z s Speech Sounds 43 Words Containing /r/ and /l/ Outer Onset rock crock curt cart car lick bottle kill 6.345 Automatic Speech Recognition r k k k k l b k Inner Onset r Nucleus a a 5 a a I a,lÍ I Inner Coda r Outer Coda Affix 1 Affix 2 Affix 3 k k t t r k t l Speech Sounds 44 Acoustic Realizations of /r/ rock /rak/ 6.345 Automatic Speech Recognition curt /k5t/ car /kar/ Speech Sounds 45 Acoustic Realizations of /l/ lick /lIk/ 6.345 Automatic Speech Recognition bottle /batlÍ / kill /kIl/ Speech Sounds 46 Allophonic Variations at Syllable Boundaries nitrate /na¤ tre¤ t/ 6.345 Automatic Speech Recognition night rate /na¤ t re¤ t/ Speech Sounds 47 Assignment 2 6.345 Automatic Speech Recognition Speech Sounds 48