Speech Processing

advertisement
Speech Processing
References
James H. McClellan, et al. Computer-Based
Exercises for Signal Processing Using MATLAB 5.
Prentice-Hall, 1998.
L.R. Rabiner and R.W. Schafer. Digital Processing of
Speech Signals. Prentice-Hall, 1978.
Lawrence Rabiner and Biing-Hwang Juang.
Fundamentals of Speech Recognition. Prentice-Hall,
1993.
The sound of spoken words is divided-up into
phonemes. European languages have about forty
phonemes. Phonemes are divided into two groups:
voiced sounds and unvoiced sounds. Voiced
sounds are “vowel-like” sounds where the sound
comes from the throat. Unvoiced phonemes are
“consonant-like” phonemes where the sound comes
from compressed air blown through the mouth.
While unvoiced phonemes are “consonant-like,” not
all consonants are unvoiced. Phonemes like “s” are
unvoiced, but phonemes like “z” are voiced.
Speech production may be modeled by the following
diagram:
Voiced
Pulse
Train
Glottis
Vocal
Tract
Lip
Radiation
Random
Noise
Unvoiced
(See Figure 10.5 in Computer-Based Exercises for Signal Processing.)
The glottis (in the throat) produces “quasi-periodic”
signals (like singing a long note). These signals are
modeled as the output of the glottis block. These
signals are then passed into a vocal tract block. The
vocal tract models the mouth, nose and teeth.
Finally the lip radiation block models the lips.
Unvoiced sounds have no glottal pulse component
and can be modeled with the vocal tract and lip
radiation blocks. To obtain any kind of sound, the
input to the vocal tract and lip radiation blocks
cannot be simply a unit step but rather a random
process.
Let us give function values to these signals and
processes:
G(z)
Pulse
Train
Voiced
e[n]
Glottis
V(z)
uG[n]
R(z)
pL[n]
Vocal uL[n]
Lip
Tract
Radiation
Random
Noise
Unvoiced
e[n] is a periodic pulse train.
G(z) is the transfer function of the glottis
uG[n] is the glottis output.
V(z) is the transfer function of the vocal tract.
uL[n] is the output of the vocal tract.
R(z) is the transfer function of the lips.
pL[n] is the output of the lips.
The glottal transfer function G(z) will be represented
by an exponential model:
G (z) 
U G (z)
E (z)

(  e )[ a ln( a )] z
(1  az
1
)
2
1
.
The symbol e represents the base of natural
logarithms. The parameter a is some value less
than one that corresponds to the natural frequency
of the glottis (which varies from speaker to speaker,
man to woman, child to adult, etc.).
The frequency response of G(z) for various values of
a is shown on the following slide. (Graph printed
using glottal.m.)
Frequency Response
25
20
|G(ej)|
a = 0.90
15
10
a = 0.80
5
a = 0.70
0
0
0.25
0.5
, x 
0.75
1
The vocal tract V(z) can be modeled after a
sequence of “lossless tubes”:
uG[n]
uL[n]
Ak-1
Ak
Ak+1
Each “tube” has a cross-sectional area Ak.
The vocal tract transfer function V(z) will be
represented by following model:
N
V (z) 
U L (z)
U G (z)


(1  rk ) z
k 1
N /2
.
D (z)
The parameters rk (which correspond to reflection
coefficients along the vocal tract) are found from
rk 
Ak 1  Ak
Ak 1  Ak
Where Ak (k=1, … N) are parameters corresponding
to cross-sectional areas of the vocal tract. (These
values are given for a particular phoneme.)
The denominator D(z) is found from the recursive
relationship:
D k ( z )  D k 1 ( z )  rk z
k
1
D k 1 ( z )
starting with D0(z) = 1 and ending with D(z) = DN(z).
The numerator G [of V(z)] is found by
N
G 
 (1  r
k
).
k 1
Finally, the lip radiation transfer function is given by
R(z) 
PL ( z )
U L (z)
1
 1 z .
The previous voice model was implemented in
MATLAB in a script file called voice.m.
The vocal tract transfer function V(z) parameters are
computed by a MATLAB function called AtoV().
The glottal transfer function G(z) coefficients are
assigned to arrays numg and deng.
The vocal tract/lip radiation transfer function
V(z)R(z) coefficients are assigned to arrays numv
and denv.
numg, deng
G(z)
Pulse
Train
Voiced
AtoV  numv, denv
e[n]
Glottis
V(z)
uG[n]
pL[n]
Vocal uL[n]
Lip
Tract
Radiation
Random
Noise
Unvoiced
uG[n] = rand();
R(z)
AtoV()
rk 
Ak 1  Ak
Ak 1  Ak
for k=1:N-1
r = [r (A(k+1)-A(k))/(A(k+1)+A(k))];
end;
D k ( z )  D k 1 ( z )  rk z
k
1
D k 1 ( z ).
N
G 
 (1  r
k
).
k 1
for k=1:N
D = [D 0] + r(k).*[0 fliplr(D)];
G = G*(1+r(k));
end;
The array p is a pulse train
Voiced Speech
ug = 0.1*filter(numg,deng,p);
pl = filter(numv,denv,ug);
Unvoiced Speech
ug = 0.01*randn(1,10000);
pl = filter(numv,denv,ug);
Given the vocal tract areas Ak for a given vowel, we
can synthesize the vowels.
In the following demonstration, we will synthesize
the phonemes AA and IY.
The phoneme AA is like a short a (ă)
The phoneme IY is like a long e (ē).
AA voiced
(aav.wav)
AA unvoiced
(aau.wav)
IY voiced
(iyv.wav)
IY unvoiced
(iyu.wav)
Download