Acoustic • – Lecture # 2

advertisement
Lecture # 2
Session 2003
Acoustic Theory of Speech Production
• Overview
• Sound sources
• Vocal tract transfer function
– Wave equations
– Sound propagation in a uniform acoustic tube
• Representing the vocal tract with simple acoustic tubes
• Estimating natural frequencies from area functions
• Representing the vocal tract with multiple uniform tubes
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 1
A n a t o m i ca l S t r u ct u r e s f o r S p e e ch P r o d u ct i o n
6 . 3 4 5 Autom atic Speech Recognition
Acous tic T heory of Speech Production 2
Phonemes in American English
PHONEME
/i¤/
/I/
/e¤/
/E/
/@/
/a/
/O/
/^/
/o⁄/
/U/
/u⁄/
/5/
/a¤/
/O¤/
/a⁄/
/{/
EXAMPLE
beat
bit
bait
bet
bat
Bob
bought
but
boat
book
boot
Burt
bite
Boyd
bout
about
6.345 Automatic Speech Recognition
PHONEME
/s/
/S/
/f/
/T/
/z/
/Z/
/v/
/D/
/p/
/t/
/k/
/b/
/d/
/g/
EXAMPLE
see
she
fee
thief
z
Gigi
v
thee
pea
tea
key
bee
Dee
geese
PHONEME
/w/
/r/
/l/
/y/
/m/
/n/
/4/
/C/
/J/
/h/
EXAMPLE
wet
red
let
yet
meet
neat
sing
church
judge
heat
Acoustic Theory of Speech Production 3
Places of Articulation for Speech Sounds
Palato-Alveolar
Alveolar
Labial
Dental
6.345 Automatic Speech Recognition
Palatal
Velar
Uvular
Acoustic Theory of Speech Production 4
Speech Waveform: An Example
Two plus seven is less than ten
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 5
A Wideband Spectrogram
Two plus seven is less than ten
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 6
Acoustic Theory of Speech Production
• The acoustic characteristics of speech are usually modelled as a
sequence of source, vocal tract filter, and radiation characteristics
UL
Pr
r
UG
Pr (jΩ) = S(jΩ) T (jΩ) R(jΩ)
• For vowel production:
S(jΩ) = UG (jΩ)
T (jΩ) = UL (jΩ) / UG (jΩ)
R(jΩ) = Pr (jΩ) / UL (jΩ)
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 7
Sound Source: Vocal Fold Vibration
Modelled as a volume velocity source at glottis, UG (jΩ)
Pr ( t )
To = 1/Fo
UG ( f )
t
1/f2
UG ( t )
f
t
Men
Women
Children
F0 ave (Hz)
125
225
300
6.345 Automatic Speech Recognition
F0 min (Hz)
80
150
200
F0 max (Hz)
200
350
500
Acoustic Theory of Speech Production 8
Sound Source: Turbulence Noise
• Turbulence noise is produced at a constriction in the vocal tract
– Aspiration noise is produced at glottis
– Frication noise is produced above the glottis
• Modelled as series pressure source at constriction, PS (jΩ)
Ps ( f )
0.2 V
D
V : Velocity at constriction
6.345 Automatic Speech Recognition
f
�
D: Critical dimension =
4A √
≈ A
π
Acoustic Theory of Speech Production 9
Vocal Tract Wave Equations
Define:
u(x, t)
U (x, t)
p(x, t)
ρ
c
=⇒
=⇒
=⇒
=⇒
=⇒
particle velocity
volume velocity (U = uA)
sound pressure variation (P = PO + p)
density of air
velocity of sound
• Assuming plane wave propagation (for a cross dimension � λ),
and a one-dimensional wave motion, it can be shown that
∂u
∂p
=ρ
−
∂x
∂t
∂u
1 ∂p
−
=
∂x ρc 2 ∂t
1 ∂2 u
∂2 u
= 2 2
2
∂x
c ∂t
• Time and frequency domain solutions are of the form
�
1 �
x
x
+
−
−sx/c
sx/c
u(x, s) =
− P− e
u(x, t) = u (t − ) − u (t + )
P+ e
c
ρc
� c
�
x
x
+
−
p(x, t) = ρc u (t − ) + u (t + )
p(x, s) = P+ e−sx/c + P− esx/c
c
c
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 10
Propagation of Sound in a Uniform Tube
A
UG
x = -l
x = 0
• The vocal tract transfer function of volume velocities is
UL (jΩ) U (−�, jΩ)
=
T (jΩ) =
UG (jΩ)
U (0, jΩ)
• Using the boundary conditions U (0, s) = UG (s) and P(−�, s) = 0
T (s) =
es�/c
2
+ e−s�/c
T (jΩ) =
1
cos(Ω�/c)
• The poles of the transfer function T (jΩ) are where cos(Ω�/c) = 0
(2πfn )� (2n − 1)
=
π
c
2
6.345 Automatic Speech Recognition
c
fn =
(2n−1)
4�
4�
λn =
(2n − 1)
n = 1, 2, . . .
Acoustic Theory of Speech Production 11
Propagation of Sound in a Uniform Tube (con’t)
• For c = 34, 000 cm/sec, � = 17 cm, the natural frequencies (also
called the formants) are at 500 Hz, 1500 Hz, 2500 Hz, . . .
jΩ
20 log10 T ( j Ω
)
∞
∞
∞
∞
∞
x
40
x
20
x
0
σ
x
0
1
2
3
Frequency ( kHz )
4
5
x
x
• The transfer function of a tube with no side branches, excited at
one end and response measured at another, only has poles
• The formant frequencies will have finite bandwidth when vocal
tract losses are considered (e.g., radiation, walls, viscosity, heat)
• The length of the vocal tract, �, corresponds to 14 λ1 , 34 λ2 , 54 λ3 , ...,
where λi is the wavelength of the i th natural frequency
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 12
Standing Wave Patterns in a Uniform Tube
A uniform tube closed at one end and open at the other is often
referred to as a quarter wavelength resonator
x
glottis
lips
|U(x)|
SWP for
F1
SWP for
F2
2
3
SWP for
F3
2
5
6.345 Automatic Speech Recognition
4
5
Acoustic Theory of Speech Production 13
Natural Frequencies of Simple Acoustic Tubes
A
z-l
x = -l
x = 0
Quarter wavelength resonator
Ωx
P(x, jΩ) = 2P+ cos
c
U(x, jΩ) = −j
A
z-l
Ωx
A
2P+ sin
ρc
c
x = -l
x = 0
Half-wavelength resonator
Ωx
P(x, jΩ) = −j2P+ sin
c
U(x, jΩ) =
A
Ωx
2P+ cos
ρc
c
Ω�
Ω�
A
A
tan
Y−� = −j
cot
Y−� = j
ρc
c
ρc
c
1
A
A�
= −j
≈ −j
Ω�/c � 1
≈ jΩ 2 = jΩCA Ω�/c � 1
Ωρ�
ΩMA
ρc
CA = A�/ρc 2 = acoustic compliance MA = ρ�/A = acoustic mass
fn =
c
(2n − 1)
4�
n = 1, 2, . . .
6.345 Automatic Speech Recognition
fn =
c
n n = 0, 1, 2, . . .
2�
Acoustic Theory of Speech Production 14
Approximating Vocal Tract Shapes
[� i� ]
A1
l1
[ a� ]
[� u� ] A2
l2
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 15
Estimating Natural Resonance Frequencies
• Resonance frequencies occur where impedance (or admittance)
function equals natural (e.g., open circuit) boundary conditions
UG
A1
A2
l1
UL
l2
Y 1+ Y 2= 0
• For a two tube approximation it is easiest to solve for Y1 + Y2 = 0
Ω�1
A2
Ω�2
A1
tan
−�j
cot
=0
j
ρc
c
ρc
c
Ω�1
Ω�2 A2
Ω�2
Ω�1
sin
sin
−�
cos
=0
cos
c
c
A1
c
c
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 16
Decoupling Simple Tube Approximations
• If A1 ��A2 , or A1 ��A2 , the tubes can be decoupled and natural
frequencies of each tube can be computed independently
• For the vowel /i¤/, the formant frequencies are obtained from:
A1
A2
l1
fn =
• At low frequencies:
f =
c
n
2�1
��
c
A2
2π A1 �1 �2
l2
plus
�1/2
fn =
��
=
c
n
2�2
1
1
2π CA1 MA2
�1/2
• This low resonance frequency is called the Helmholtz resonance
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 17
Vowel Production Example
2
2
1 cm
1 cm
2
2
9 cm
7 cm
8 cm
8 cm
9 cm
+
972
2917
.
.
.
Formant
F1
F2
F3
.
.
Actual
789
1276
2808
.
.
6.345 Automatic Speech Recognition
6 cm
+
1093
.
.
.
.
Estimated
972
1093
2917
.
.
268
+
1944
.
.
.
.
Formant
F1
F2
F3
.
.
2917
.
.
.
.
Actual
256
1905
2917
.
.
Estimated
268
1944
2917
.
.
Acoustic Theory of Speech Production 18
Example of Vowel Spectrograms
16
0.0
0.1
0.2
Zero Crossing Rate
Time (seconds)
0.3
0.4
0.5
0.6
0.7
kHz 8
16
8 kHz
0
0
16
0.5
0.6
0.7
kHz 8
16
8 kHz
0
Total Energy
0
Total Energy
dB
dB
dB
dB
dB
Energy -- 125 Hz to 750 Hz
dB
Energy -- 125 Hz to 750 Hz
dB
8
Time (seconds)
0.3
0.4
0.0
0.1
0.2
Zero Crossing Rate
dB
8
8
7
7
7
7
6
6
6
6
5
5
5
5
Wide Band Spectrogram
kHz 4
4 kHz
8
Wide Band Spectrogram
kHz 4
4 kHz
3
3
3
3
2
2
2
2
1
1
1
1
0
0
0
Waveform
0.0
0.1
0
Waveform
0.2
0.3
0.4
/bit/
6.345 Automatic Speech Recognition
0.5
0.6
0.7
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
/bat/
Acoustic Theory of Speech Production 19
Estimating Anti-Resonance Frequencies (Zeros)
Zeros occur at frequencies where there is no measurable output
ln
UG
Yn
Ap
Yp
lp
An
UN
Ao
Ab
lo
lb
Ac
Yo
lc
Ps A f
UL
lf
• For nasal consonants, zeros in UN occur where YO = ∞
• For fricatives or stop consonants, zeros in UL occur where the
impedance behind source is infinite (i.e., a hard wall at source)
Y1 = 0
Y 3+ Y 4= 0
• Zeros occur when measurements are made in vocal tract interior
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 20
Consonant Production
Ab
Ac
lb
Ps A f
lc
lf
POLES
+
[g]
[s]
ZEROS
+
Ab
5
5
+
Ac
0.2
0.5
[g]
poles zeros
215
0
1750 1944
1944 2916
3888 3888
.
.
.
.
6.345 Automatic Speech Recognition
+
Af
4
4
�b
9
11
�c
3
3
�f
5
2.5
[s]
poles zeros
306
0
1590 1590
3180 2916
3500 3180
.
.
.
.
Acoustic Theory of Speech Production 21
Example of Consonant Spectrograms
0.0
0.1
0.2
16
Zero Crossing Rate
kHz 8
Time (seconds)
0.3
0.4
0.5
0.6
0.7
16
8 kHz
0
0
0.0
0.1
0.2
16
Zero Crossing Rate
kHz 8
Time (seconds)
0.4
0.5
0.6
0.7
0.8
16
8 kHz
0
Total Energy
0
Total Energy
dB
dB
dB
dB
dB
Energy -- 125 Hz to 750 Hz
dB
Energy -- 125 Hz to 750 Hz
dB
8
0.3
dB
8
8
7
7
7
7
6
6
6
6
5
5
5
5
Wide Band Spectrogram
kHz 4
4 kHz
8
Wide Band Spectrogram
kHz 4
4 kHz
3
3
3
3
2
2
2
2
1
1
1
1
0
0
0
Waveform
0.0
0.1
0
Waveform
0.2
0.3
0.4
/ki¤ p/
6.345 Automatic Speech Recognition
0.5
0.6
0.7
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
/si¤ /
Acoustic Theory of Speech Production 22
Perturbation Theory
Y� � −j
A
Yl
A
for small �
Ωρ�
l
• Consider a uniform tube, closed at one end and open at the other
l
Δ x
• Reducing the area of a small piece of the tube near the opening
(where U is max) has the same effect as keeping the area fixed
and lengthening the tube
• Since lengthening the tube lowers the resonant frequencies,
narrowing the tube near points where U (x) is maximum in the
standing wave pattern for a given formant decreases the value of
that formant
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 23
Perturbation Theory (cont’d)
Y� � jΩ
A
Yl
A�
for small �
2
ρc
l
l
Δ x
• Reducing the area of a small piece of the tube near the closure
(where p is max) has the same effect as keeping the area fixed and
shortening the tube
• Since shortening the tube will increase the values of the formants,
narrowing the tube near points where p(x) is maximum in the
standing wave pattern for a given formant will increase the value
of that formant
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 24
Summary of Perturbation Theory Results
x
glottis
lips
x
glottis
lips
|U(x)|
+
ΔF1
SWP for
F1
−
1
2
(as a consequence of decreasing A)
ΔF2
SWP for
F2
+
+
−
2
3
ΔF3
SWP for
F3
2
5
6.345 Automatic Speech Recognition
4
5
+
−
−
1
2
+
1
2
+
−
−
Acoustic Theory of Speech Production 25
Illustration of Perturbation Theory
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 26
Illustration of Perturbation Theory
The ship was torn apart on the sharp (reef)
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 27
Illustration of Perturbation Theory
(The ship was torn apart on the sh)arp reef
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 28
Multi-Tube Approximation of the Vocal Tract
• We can represent the vocal tract as a concatenation of N lossless
tubes with constant area {Ak }�and equal length Δx = �/N
• The wave propagation time through each tube is τ =
A
Δx
6.345 Automatic Speech Recognition
Δx
c
=
�
Nc
A7
Δx
Δx
Δx
Δx
Δx
Δx
Acoustic Theory of Speech Production 29
Wave Equations for Individual Tube
The wave equations for the kth tube have the form
ρc +
x
x
pk (x, t) =
[Uk (t −� ) + Uk−�(t + )]
c
Ak
c
Uk (x, t) = Uk+ (t −� xc ) −�Uk−�(t + xc )
where x is measured from the left-hand side (0 ≤�x ≤�Δx)
+
+
U k ( t - τ ) U k+1( t )
-
U k ( t + τ ) U k+1 ( t )
+
Uk ( t )
-
Uk ( t )
-
+
U k+1 ( t - τ )
U k+1
(t+τ)
Δx
Ak
Δx
A k+1
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 30
Update Expression at Tube Boundaries
We can solve update expressions using continuity constraints at
tube boundaries e.g., pk (Δx, t) = pk+1 (0, t), and Uk (Δx, t) = Uk+1 (0, t)
+
+
Uk (
t)
DELAY
+
Uk ( t - τ )
τ
Uk + 1 ( t )
1 + rk
-
1 - rk
DELAY
τ
U k(
t +τ )
τ
Uk+1( t - τ )
DELAY
U k + 1( t + τ )
+
rk
- rk
Uk ( t )
DELAY
Uk + 1 (
k th tube
t)
-
τ
( k + 1 ) st tube
−�
Uk++1 (t) = (1 + rk )Uk+ (t −�τ) + rk Uk+1
(t)
−�
(t)
Uk− (t + τ) = −rk Uk+ (t −�τ) + (1 −�rk )Uk+1
rk =
6.345 Automatic Speech Recognition
Ak+1 −�Ak
Ak+1 + Ak
note |�rk |≤�1
Acoustic Theory of Speech Production 31
Digital Model of Multi-Tube Vocal Tract
• Updates at tube boundaries occur synchronously every 2τ
• If excitation is band-limited, inputs can be sampled every T = 2τ
• Each tube section has a delay of z−1/2
+
Uk (
z)
z
1
2
1 + rk
+
Uk + 1 ( z )
-rk
rk
-
-
Uk ( z )
z
1
2
Uk + 1 ( z )
1 - rk
• The choice of N depends on the sampling rate T
2�
�
=⇒� N =
T = 2τ = 2
Nc
cT
• Series and shunt losses can also be introduced at tube junctions
– Bandwidths are proportional to energy loss to storage ratio
– Stored energy is proportional to tube length
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 32
Assignment 1
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 33
References
• Zue, 6.345 Course Notes
• Stevens, Acoustic Phonetics, MIT Press, 1998.
• Rabiner & Schafer, Digital Processing of Speech Signals,
Prentice-Hall, 1978.
6.345 Automatic Speech Recognition
Acoustic Theory of Speech Production 34
Download