AI-82Ana-03 - Grupo de Tecnología del Habla

advertisement
Analysis of Parameter Importance in Speaker Identity
J.M. Gutiérrez-Arriola*, J.M. Montero**, R. Córdoba**, J.M. Pardo**
*Departamento de Ingeniería de Circuitos y Sistemas
**Grupo de Tecnología del Habla, Departamento de Ingeniería Electrónica
Universidad Politécnica de Madrid, Spain
jmga@ics.upm.es
Abstract
1. Introduction
Relevant features that characterize speaker identity can be
classified as physiologic and socio-psychological. Anatomic
properties of phoning organs are responsible for fundamental
frequency and spectral characteristics of speech. While our
environment, education or mood characterize accent, F0 curve,
word duration, rhythm, pausing, energy level,…
History about speaker identity investigation is long. It comes
from the first psychological and phonetic studies, that tried to
find the relation between acoustic parameters and speaker
characteristics such as: age, sex, height, …, to the more recent
studies that try to relate acoustic parameters to speaker identity
information [1], [2], [3].
Parameters that describe speaker identity are:
o Vocal source:

Average fundamental frequency, F0.

F0 curve.

F0 fluctuation.

Energy fluctuation.

Glottal curve.
o Vocal tract:

Vocal tract length.

Spectral envelope and spectral inclination.

Relative
differences
between
formants
amplitudes.

Formant frequencies.

Formant trajectories.

Formant bandwidths.
We will not assume that a single acoustic parameter contains
all speaker identity information. We think that voice quality is
represented by several parameters and its order of relevance
can vary from speaker to speaker and from task to task.
We will try to asses the relevance of the parameters we are
using in a speaker transformation system [6].
2. Speech analysis
We have selected eleven speakers from EUROM1 database in
Castilian Spanish. They say the sentence: “Mi abuelo me
animó a estudiar solfeo”. Sampling frequency is 16kHz.
Sentences have been analyzed to extract the following
parameters:

F0.

Formant frequencies F1, F2, F3, F4, F5 and
F6.

Formant bandwidth.

LF simplified glotal source parameters.
2.1. Fundamental frequency
Glottal closure instants are calculated using an algorithm
based on Hilbert transform. Results are manually revised to
fix possible errors.
2.2. Glottal source parameters
We have chosen Liljencrants-Fant model as glottal source
mathematical model. This model describes glottal flow
derivative as follows:
E o e αt sin ω g t
0  t  te
dU g t 

 Et     E e ε  t  t  ε  t  t 
e
dt
e c e
t e  t  t c  To
 εt e
 a
 


Eo
Emax
amplitude
We present the analysis performed over eleven speakers (five
women and 6 men) in order to obtain the most important
parameters as far as speaker identity is concerned. Parameters
that have been studied are F0, six formants, five bandwidths
and four source parameters. Feature selection is based on
linear discriminant analysis. Results show that the most
relevant parameter is F0, followed by formant frequencies and
open quotient.
0
-Ee
0
ta
tp
te tc
time
To
Figure 1. LF model
Temporal parameters (tp,te,tc,ta,T0) have been transformed
into four parameters:
1. Open quotient. It relates the time glottis remains
opened with the duration of the glottal period.
tc
100
T0
Velocity coefficient. It is the time glottis is opening
over the time glottis is closing.
tp
VELO 
tc  t p
KOPEN 
2.
3.
Glottal closure skewness. It is the relation between
the maximum glottal flow and the maximum
excitation instant in its derivative. Low values reflect
abrupt glottal closure and high values reflect slower
glottal closure.
te  t p
SKEW 
tp
0
tini
tp
te
tc
Ee
Figure 4. Extraction of LF model parameters.
4.
Return coefficient. It reflects the speed of the glottal
flow from the point of maximum excitation to the
glottal closure instant.
1
RET 
2t a
To extract glottal source parameters we use the system
described in figure 2 .
LPC
Analysis
r(t)
residue
Integrator
i(t)
residue
integral
Calculate source
parameters
Polinomic
aproximation
ia(t),
aproximate
integral
tp, te, tc, ta, Ee, Eo
Figure 5. Integral, polinomic approximation and LF
approximation.
Figure 2. System to extract glottal source parameters.
The integral of the residue of the LPC analysis is
approximated by a six order polinomial period by period
[4],[5]. The polinomic coefficients are calculated with a curve
fitting algorithm. Results are shown in figure 3
.
2.3. Vocal tract parameters
We calculate six formant frequencies and five bandwidths to
describe the vocal tract. To extract these parameters we
analyzed the energy distribution of the spectral envelope.
We calculate the roots of the LP coefficients and select
the roots that contain more energy. We automatically correct
formant trajectories to avoid mistakes during voiced parts of
speech. Finally, formant frequencies and bandwidths are
manually revised to fix errors.
3. Linear Discriminant Analysis
Figure 3. Residue integral and its polinomic
approximation
It is interesting to use the approximation instead of the
integral itself when the stochastic component is high or the
integral does not adjust to the model. The polinomic
approximation reflects low frequency components and it is
more suitable to calculate LF model parameters.
To extract LF parameters we search for significant points
in the integral approximation as shown in figure 4.
Discriminant functions are linear combinations of variables
that best separate groups. These functions can be used to rank
the variables in terms of their relative contribution to group
separation [7].
The aim of linear discriminant analysis is to obtain N
functions (where N is the number of variables) that transform
the original y variables in the new z variables that best
separate the K groups.
zi  ai ' y  ai1 y1  ai 2 y2  ...  aiN y N
i  1,2,..., N
To offset differing scales among the variables, the
discriminant function coefficients can be standardized using:
y  yN
y y
y  y2
zi  ai1* 1 1  ai 2* 2
 ...  aiN * N
s1
s2
sN
Where y is the variable mean and s is the standard
deviation of the variable. The standardized variables are scale
free. The standardized coefficients indicate the influence of
each variable in the presence of the others. The absolute
values of the coefficients ar* can be used to rank the variables
in order of their contribution to separating groups.
To calculate the coefficients we use Fisher linear
discriminant analysis and then we rank the coefficients of the
first discriminant function.
4. Results
The sentence “Mi abuelo me animó a estudiar solfeo” uttered
by eleven speakers has been analyzed. Analysis results are
shown in the following figures.
0
500
1000
1500
2000
2500
Figure 8. Return coefficient of the eleven speakers.
20
40
60
80
100
Figure 6. Open quotient of the eleven speakers.
0
1
2
3
4
Figure 9. Velocity coefficient of the eleven speakers.
0
0.1
0.2
0.3
0.4
Figure 7. SKEW of the eleven speakers
0.5
0.6
In the previous figures the right and left lines of the
‘boxes’ are the 25th and 75th percentiles of the sample. The
distance between them is the interquartile range. The line in
the middle of the box is the sample median
.
Linear discriminant analysis is applied to all the
combinations of speakers taken from 2 by 2 to all together.
When using the parameters for the whole utterance linear
discriminant does not work. It only separates males of females
taking into account F0 differences.
We performed the same analysis phone by phone,
considering only the voiced sounds. All the speakers for all
combinations were successfully classified. We consider the
first discriminant function to rank the variables and count
how many times the variable is the most relevant and how
many times it is the least relevant. Results are shown in figure
10.
more relevant
more relevant
15000
200
10000
100
5000
0
F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE
0
less relevant
F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE
less relevant
10000
100
5000
0
50
F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE
Figure 10. Results of linear discriminant analysis.
We can see that the most relevant parameter is F0
followed by formants and OQ. The least significant parameter
is the velocity coefficient.
If we use only the vowels the results are quite similar as
we can see in figure 11.
more relevant
10000
5000
0
F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE
Figure 13. Results for male speakers.
5. Conclusions
We have tried to show that the parameters we have considered
in a speaker conversion task are relevant for speaker identity.
Our results show that the order of importance depends on
the type of speech we are considering, the gender of the
speaker and the type of phones.
Results show that F0 , formant frequencies and OQ are
the most important parameters for speaker classification.
6. References
0
F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE
less relevant
8000
6000
4000
2000
0
F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE
Figure 11. Results using only vowels.
When considering only female speakers results change
significantly. We can see in figure 12 that the most relevant
parameter is F5 followed by B2, F2 and OQ. Results obtained
for male speakers are shown in figure 13.
more relevant
100
50
0
F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE
less relevant
100
50
0
F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE
Figure 12. Results for female speakers.
[1] Schwartz,M.F., Rine,H.E. “Identification of speaker sex
from isolated, whispered vowels”. J. Acoust. Soc. Amer.,
vol. 44, pp 1736-1737. 1968.
[2] Childers,D.G., Lee,C.K. “Vocal quality factors; Analysis,
synthesis and perception”. J. Acoust. Soc. Amer., vol.90
(5), pp 2394-2410. 1991.
[3] Bachorowski,J.A., Owren,M.J. “Acoustic correlates of
talker sex and individual talker identity are present in a
short vowel segment produced in running speech”. J.
Acoust. Soc. Amer., vol 106 (2), pp 1054-1063. 1999.
[4] Childers, D.G., Hu, H.T. “Speech Synthesis by glottal
excited linear prediction”. J. Acoust. Soc. Amer., vol.96,
pp 2026-2036. 1994
[5] Gutiérrez-Arriola, J.M., Giménez de los Galanes, F.,
Savoji, M.H., Pardo, J.M. “Speech synthesis and prosody
modification using segmentation and modelling of the
excitation signal”. Proc. Eurospeech'97, vol. 2, pp 10591062. Rodas, 1997
[6] Gutiérrez-Arriola, J.M., Hsiao, Y.S., Montero, J.M.,
Pardo, J.M., Childers, D.G. “Voice Conversion Based on
Parameter Transformation”. Proc. ICSLP'98, vol. 3, pp
987-990. Sidney, 1998
[7] Rencher, A.C. “Methods of multivariate analysis”. Ed.
John Wiley & Sons. 1995
Download