Lecture # 2 Session 2003 Acoustic Theory of Speech Production • Overview • Sound sources • Vocal tract transfer function – Wave equations – Sound propagation in a uniform acoustic tube • Representing the vocal tract with simple acoustic tubes • Estimating natural frequencies from area functions • Representing the vocal tract with multiple uniform tubes 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 1 A n a t o m i ca l S t r u ct u r e s f o r S p e e ch P r o d u ct i o n 6 . 3 4 5 Autom atic Speech Recognition Acous tic T heory of Speech Production 2 Phonemes in American English PHONEME /i¤/ /I/ /e¤/ /E/ /@/ /a/ /O/ /^/ /o⁄/ /U/ /u⁄/ /5/ /a¤/ /O¤/ /a⁄/ /{/ EXAMPLE beat bit bait bet bat Bob bought but boat book boot Burt bite Boyd bout about 6.345 Automatic Speech Recognition PHONEME /s/ /S/ /f/ /T/ /z/ /Z/ /v/ /D/ /p/ /t/ /k/ /b/ /d/ /g/ EXAMPLE see she fee thief z Gigi v thee pea tea key bee Dee geese PHONEME /w/ /r/ /l/ /y/ /m/ /n/ /4/ /C/ /J/ /h/ EXAMPLE wet red let yet meet neat sing church judge heat Acoustic Theory of Speech Production 3 Places of Articulation for Speech Sounds Palato-Alveolar Alveolar Labial Dental 6.345 Automatic Speech Recognition Palatal Velar Uvular Acoustic Theory of Speech Production 4 Speech Waveform: An Example Two plus seven is less than ten 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 5 A Wideband Spectrogram Two plus seven is less than ten 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 6 Acoustic Theory of Speech Production • The acoustic characteristics of speech are usually modelled as a sequence of source, vocal tract filter, and radiation characteristics UL Pr r UG Pr (jΩ) = S(jΩ) T (jΩ) R(jΩ) • For vowel production: S(jΩ) = UG (jΩ) T (jΩ) = UL (jΩ) / UG (jΩ) R(jΩ) = Pr (jΩ) / UL (jΩ) 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 7 Sound Source: Vocal Fold Vibration Modelled as a volume velocity source at glottis, UG (jΩ) Pr ( t ) To = 1/Fo UG ( f ) t 1/f2 UG ( t ) f t Men Women Children F0 ave (Hz) 125 225 300 6.345 Automatic Speech Recognition F0 min (Hz) 80 150 200 F0 max (Hz) 200 350 500 Acoustic Theory of Speech Production 8 Sound Source: Turbulence Noise • Turbulence noise is produced at a constriction in the vocal tract – Aspiration noise is produced at glottis – Frication noise is produced above the glottis • Modelled as series pressure source at constriction, PS (jΩ) Ps ( f ) 0.2 V D V : Velocity at constriction 6.345 Automatic Speech Recognition f � D: Critical dimension = 4A √ ≈ A π Acoustic Theory of Speech Production 9 Vocal Tract Wave Equations Define: u(x, t) U (x, t) p(x, t) ρ c =⇒ =⇒ =⇒ =⇒ =⇒ particle velocity volume velocity (U = uA) sound pressure variation (P = PO + p) density of air velocity of sound • Assuming plane wave propagation (for a cross dimension � λ), and a one-dimensional wave motion, it can be shown that ∂u ∂p =ρ − ∂x ∂t ∂u 1 ∂p − = ∂x ρc 2 ∂t 1 ∂2 u ∂2 u = 2 2 2 ∂x c ∂t • Time and frequency domain solutions are of the form � 1 � x x + − −sx/c sx/c u(x, s) = − P− e u(x, t) = u (t − ) − u (t + ) P+ e c ρc � c � x x + − p(x, t) = ρc u (t − ) + u (t + ) p(x, s) = P+ e−sx/c + P− esx/c c c 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 10 Propagation of Sound in a Uniform Tube A UG x = -l x = 0 • The vocal tract transfer function of volume velocities is UL (jΩ) U (−�, jΩ) = T (jΩ) = UG (jΩ) U (0, jΩ) • Using the boundary conditions U (0, s) = UG (s) and P(−�, s) = 0 T (s) = es�/c 2 + e−s�/c T (jΩ) = 1 cos(Ω�/c) • The poles of the transfer function T (jΩ) are where cos(Ω�/c) = 0 (2πfn )� (2n − 1) = π c 2 6.345 Automatic Speech Recognition c fn = (2n−1) 4� 4� λn = (2n − 1) n = 1, 2, . . . Acoustic Theory of Speech Production 11 Propagation of Sound in a Uniform Tube (con’t) • For c = 34, 000 cm/sec, � = 17 cm, the natural frequencies (also called the formants) are at 500 Hz, 1500 Hz, 2500 Hz, . . . jΩ 20 log10 T ( j Ω ) ∞ ∞ ∞ ∞ ∞ x 40 x 20 x 0 σ x 0 1 2 3 Frequency ( kHz ) 4 5 x x • The transfer function of a tube with no side branches, excited at one end and response measured at another, only has poles • The formant frequencies will have finite bandwidth when vocal tract losses are considered (e.g., radiation, walls, viscosity, heat) • The length of the vocal tract, �, corresponds to 14 λ1 , 34 λ2 , 54 λ3 , ..., where λi is the wavelength of the i th natural frequency 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 12 Standing Wave Patterns in a Uniform Tube A uniform tube closed at one end and open at the other is often referred to as a quarter wavelength resonator x glottis lips |U(x)| SWP for F1 SWP for F2 2 3 SWP for F3 2 5 6.345 Automatic Speech Recognition 4 5 Acoustic Theory of Speech Production 13 Natural Frequencies of Simple Acoustic Tubes A z-l x = -l x = 0 Quarter wavelength resonator Ωx P(x, jΩ) = 2P+ cos c U(x, jΩ) = −j A z-l Ωx A 2P+ sin ρc c x = -l x = 0 Half-wavelength resonator Ωx P(x, jΩ) = −j2P+ sin c U(x, jΩ) = A Ωx 2P+ cos ρc c Ω� Ω� A A tan Y−� = −j cot Y−� = j ρc c ρc c 1 A A� = −j ≈ −j Ω�/c � 1 ≈ jΩ 2 = jΩCA Ω�/c � 1 Ωρ� ΩMA ρc CA = A�/ρc 2 = acoustic compliance MA = ρ�/A = acoustic mass fn = c (2n − 1) 4� n = 1, 2, . . . 6.345 Automatic Speech Recognition fn = c n n = 0, 1, 2, . . . 2� Acoustic Theory of Speech Production 14 Approximating Vocal Tract Shapes [� i� ] A1 l1 [ a� ] [� u� ] A2 l2 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 15 Estimating Natural Resonance Frequencies • Resonance frequencies occur where impedance (or admittance) function equals natural (e.g., open circuit) boundary conditions UG A1 A2 l1 UL l2 Y 1+ Y 2= 0 • For a two tube approximation it is easiest to solve for Y1 + Y2 = 0 Ω�1 A2 Ω�2 A1 tan −�j cot =0 j ρc c ρc c Ω�1 Ω�2 A2 Ω�2 Ω�1 sin sin −� cos =0 cos c c A1 c c 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 16 Decoupling Simple Tube Approximations • If A1 ��A2 , or A1 ��A2 , the tubes can be decoupled and natural frequencies of each tube can be computed independently • For the vowel /i¤/, the formant frequencies are obtained from: A1 A2 l1 fn = • At low frequencies: f = c n 2�1 �� c A2 2π A1 �1 �2 l2 plus �1/2 fn = �� = c n 2�2 1 1 2π CA1 MA2 �1/2 • This low resonance frequency is called the Helmholtz resonance 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 17 Vowel Production Example 2 2 1 cm 1 cm 2 2 9 cm 7 cm 8 cm 8 cm 9 cm + 972 2917 . . . Formant F1 F2 F3 . . Actual 789 1276 2808 . . 6.345 Automatic Speech Recognition 6 cm + 1093 . . . . Estimated 972 1093 2917 . . 268 + 1944 . . . . Formant F1 F2 F3 . . 2917 . . . . Actual 256 1905 2917 . . Estimated 268 1944 2917 . . Acoustic Theory of Speech Production 18 Example of Vowel Spectrograms 16 0.0 0.1 0.2 Zero Crossing Rate Time (seconds) 0.3 0.4 0.5 0.6 0.7 kHz 8 16 8 kHz 0 0 16 0.5 0.6 0.7 kHz 8 16 8 kHz 0 Total Energy 0 Total Energy dB dB dB dB dB Energy -- 125 Hz to 750 Hz dB Energy -- 125 Hz to 750 Hz dB 8 Time (seconds) 0.3 0.4 0.0 0.1 0.2 Zero Crossing Rate dB 8 8 7 7 7 7 6 6 6 6 5 5 5 5 Wide Band Spectrogram kHz 4 4 kHz 8 Wide Band Spectrogram kHz 4 4 kHz 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 Waveform 0.0 0.1 0 Waveform 0.2 0.3 0.4 /bit/ 6.345 Automatic Speech Recognition 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 /bat/ Acoustic Theory of Speech Production 19 Estimating Anti-Resonance Frequencies (Zeros) Zeros occur at frequencies where there is no measurable output ln UG Yn Ap Yp lp An UN Ao Ab lo lb Ac Yo lc Ps A f UL lf • For nasal consonants, zeros in UN occur where YO = ∞ • For fricatives or stop consonants, zeros in UL occur where the impedance behind source is infinite (i.e., a hard wall at source) Y1 = 0 Y 3+ Y 4= 0 • Zeros occur when measurements are made in vocal tract interior 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 20 Consonant Production Ab Ac lb Ps A f lc lf POLES + [g] [s] ZEROS + Ab 5 5 + Ac 0.2 0.5 [g] poles zeros 215 0 1750 1944 1944 2916 3888 3888 . . . . 6.345 Automatic Speech Recognition + Af 4 4 �b 9 11 �c 3 3 �f 5 2.5 [s] poles zeros 306 0 1590 1590 3180 2916 3500 3180 . . . . Acoustic Theory of Speech Production 21 Example of Consonant Spectrograms 0.0 0.1 0.2 16 Zero Crossing Rate kHz 8 Time (seconds) 0.3 0.4 0.5 0.6 0.7 16 8 kHz 0 0 0.0 0.1 0.2 16 Zero Crossing Rate kHz 8 Time (seconds) 0.4 0.5 0.6 0.7 0.8 16 8 kHz 0 Total Energy 0 Total Energy dB dB dB dB dB Energy -- 125 Hz to 750 Hz dB Energy -- 125 Hz to 750 Hz dB 8 0.3 dB 8 8 7 7 7 7 6 6 6 6 5 5 5 5 Wide Band Spectrogram kHz 4 4 kHz 8 Wide Band Spectrogram kHz 4 4 kHz 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 Waveform 0.0 0.1 0 Waveform 0.2 0.3 0.4 /ki¤ p/ 6.345 Automatic Speech Recognition 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 /si¤ / Acoustic Theory of Speech Production 22 Perturbation Theory Y� � −j A Yl A for small � Ωρ� l • Consider a uniform tube, closed at one end and open at the other l Δ x • Reducing the area of a small piece of the tube near the opening (where U is max) has the same effect as keeping the area fixed and lengthening the tube • Since lengthening the tube lowers the resonant frequencies, narrowing the tube near points where U (x) is maximum in the standing wave pattern for a given formant decreases the value of that formant 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 23 Perturbation Theory (cont’d) Y� � jΩ A Yl A� for small � 2 ρc l l Δ x • Reducing the area of a small piece of the tube near the closure (where p is max) has the same effect as keeping the area fixed and shortening the tube • Since shortening the tube will increase the values of the formants, narrowing the tube near points where p(x) is maximum in the standing wave pattern for a given formant will increase the value of that formant 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 24 Summary of Perturbation Theory Results x glottis lips x glottis lips |U(x)| + ΔF1 SWP for F1 − 1 2 (as a consequence of decreasing A) ΔF2 SWP for F2 + + − 2 3 ΔF3 SWP for F3 2 5 6.345 Automatic Speech Recognition 4 5 + − − 1 2 + 1 2 + − − Acoustic Theory of Speech Production 25 Illustration of Perturbation Theory 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 26 Illustration of Perturbation Theory The ship was torn apart on the sharp (reef) 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 27 Illustration of Perturbation Theory (The ship was torn apart on the sh)arp reef 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 28 Multi-Tube Approximation of the Vocal Tract • We can represent the vocal tract as a concatenation of N lossless tubes with constant area {Ak }�and equal length Δx = �/N • The wave propagation time through each tube is τ = A Δx 6.345 Automatic Speech Recognition Δx c = � Nc A7 Δx Δx Δx Δx Δx Δx Acoustic Theory of Speech Production 29 Wave Equations for Individual Tube The wave equations for the kth tube have the form ρc + x x pk (x, t) = [Uk (t −� ) + Uk−�(t + )] c Ak c Uk (x, t) = Uk+ (t −� xc ) −�Uk−�(t + xc ) where x is measured from the left-hand side (0 ≤�x ≤�Δx) + + U k ( t - τ ) U k+1( t ) - U k ( t + τ ) U k+1 ( t ) + Uk ( t ) - Uk ( t ) - + U k+1 ( t - τ ) U k+1 (t+τ) Δx Ak Δx A k+1 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 30 Update Expression at Tube Boundaries We can solve update expressions using continuity constraints at tube boundaries e.g., pk (Δx, t) = pk+1 (0, t), and Uk (Δx, t) = Uk+1 (0, t) + + Uk ( t) DELAY + Uk ( t - τ ) τ Uk + 1 ( t ) 1 + rk - 1 - rk DELAY τ U k( t +τ ) τ Uk+1( t - τ ) DELAY U k + 1( t + τ ) + rk - rk Uk ( t ) DELAY Uk + 1 ( k th tube t) - τ ( k + 1 ) st tube −� Uk++1 (t) = (1 + rk )Uk+ (t −�τ) + rk Uk+1 (t) −� (t) Uk− (t + τ) = −rk Uk+ (t −�τ) + (1 −�rk )Uk+1 rk = 6.345 Automatic Speech Recognition Ak+1 −�Ak Ak+1 + Ak note |�rk |≤�1 Acoustic Theory of Speech Production 31 Digital Model of Multi-Tube Vocal Tract • Updates at tube boundaries occur synchronously every 2τ • If excitation is band-limited, inputs can be sampled every T = 2τ • Each tube section has a delay of z−1/2 + Uk ( z) z 1 2 1 + rk + Uk + 1 ( z ) -rk rk - - Uk ( z ) z 1 2 Uk + 1 ( z ) 1 - rk • The choice of N depends on the sampling rate T 2� � =⇒� N = T = 2τ = 2 Nc cT • Series and shunt losses can also be introduced at tube junctions – Bandwidths are proportional to energy loss to storage ratio – Stored energy is proportional to tube length 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 32 Assignment 1 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 33 References • Zue, 6.345 Course Notes • Stevens, Acoustic Phonetics, MIT Press, 1998. • Rabiner & Schafer, Digital Processing of Speech Signals, Prentice-Hall, 1978. 6.345 Automatic Speech Recognition Acoustic Theory of Speech Production 34