AI-82Ana-03 - Grupo de Tecnología del Habla

Analysis of Parameter Importance in Speaker Identity J.M. Gutiérrez-Arriola*, J.M. Montero**, R. Córdoba**, J.M. Pardo** *Departamento de Ingeniería de Circuitos y Sistemas **Grupo de Tecnología del Habla, Departamento de Ingeniería Electrónica Universidad Politécnica de Madrid, Spain jmga@ics.upm.es Abstract 1. Introduction Relevant features that characterize speaker identity can be classified as physiologic and socio-psychological. Anatomic properties of phoning organs are responsible for fundamental frequency and spectral characteristics of speech. While our environment, education or mood characterize accent, F0 curve, word duration, rhythm, pausing, energy level,… History about speaker identity investigation is long. It comes from the first psychological and phonetic studies, that tried to find the relation between acoustic parameters and speaker characteristics such as: age, sex, height, …, to the more recent studies that try to relate acoustic parameters to speaker identity information [1], [2], [3]. Parameters that describe speaker identity are: o Vocal source:  Average fundamental frequency, F0.  F0 curve.  F0 fluctuation.  Energy fluctuation.  Glottal curve. o Vocal tract:  Vocal tract length.  Spectral envelope and spectral inclination.  Relative differences between formants amplitudes.  Formant frequencies.  Formant trajectories.  Formant bandwidths. We will not assume that a single acoustic parameter contains all speaker identity information. We think that voice quality is represented by several parameters and its order of relevance can vary from speaker to speaker and from task to task. We will try to asses the relevance of the parameters we are using in a speaker transformation system [6]. 2. Speech analysis We have selected eleven speakers from EUROM1 database in Castilian Spanish. They say the sentence: “Mi abuelo me animó a estudiar solfeo”. Sampling frequency is 16kHz. Sentences have been analyzed to extract the following parameters:  F0.  Formant frequencies F1, F2, F3, F4, F5 and F6.  Formant bandwidth.  LF simplified glotal source parameters. 2.1. Fundamental frequency Glottal closure instants are calculated using an algorithm based on Hilbert transform. Results are manually revised to fix possible errors. 2.2. Glottal source parameters We have chosen Liljencrants-Fant model as glottal source mathematical model. This model describes glottal flow derivative as follows: E o e αt sin ω g t 0  t  te dU g t    Et     E e ε  t  t  ε  t  t  e dt e c e t e  t  t c  To  εt e  a     Eo Emax amplitude We present the analysis performed over eleven speakers (five women and 6 men) in order to obtain the most important parameters as far as speaker identity is concerned. Parameters that have been studied are F0, six formants, five bandwidths and four source parameters. Feature selection is based on linear discriminant analysis. Results show that the most relevant parameter is F0, followed by formant frequencies and open quotient. 0 -Ee 0 ta tp te tc time To Figure 1. LF model Temporal parameters (tp,te,tc,ta,T0) have been transformed into four parameters: 1. Open quotient. It relates the time glottis remains opened with the duration of the glottal period. tc 100 T0 Velocity coefficient. It is the time glottis is opening over the time glottis is closing. tp VELO  tc  t p KOPEN  2. 3. Glottal closure skewness. It is the relation between the maximum glottal flow and the maximum excitation instant in its derivative. Low values reflect abrupt glottal closure and high values reflect slower glottal closure. te  t p SKEW  tp 0 tini tp te tc Ee Figure 4. Extraction of LF model parameters. 4. Return coefficient. It reflects the speed of the glottal flow from the point of maximum excitation to the glottal closure instant. 1 RET  2t a To extract glottal source parameters we use the system described in figure 2 . LPC Analysis r(t) residue Integrator i(t) residue integral Calculate source parameters Polinomic aproximation ia(t), aproximate integral tp, te, tc, ta, Ee, Eo Figure 5. Integral, polinomic approximation and LF approximation. Figure 2. System to extract glottal source parameters. The integral of the residue of the LPC analysis is approximated by a six order polinomial period by period [4],[5]. The polinomic coefficients are calculated with a curve fitting algorithm. Results are shown in figure 3 . 2.3. Vocal tract parameters We calculate six formant frequencies and five bandwidths to describe the vocal tract. To extract these parameters we analyzed the energy distribution of the spectral envelope. We calculate the roots of the LP coefficients and select the roots that contain more energy. We automatically correct formant trajectories to avoid mistakes during voiced parts of speech. Finally, formant frequencies and bandwidths are manually revised to fix errors. 3. Linear Discriminant Analysis Figure 3. Residue integral and its polinomic approximation It is interesting to use the approximation instead of the integral itself when the stochastic component is high or the integral does not adjust to the model. The polinomic approximation reflects low frequency components and it is more suitable to calculate LF model parameters. To extract LF parameters we search for significant points in the integral approximation as shown in figure 4. Discriminant functions are linear combinations of variables that best separate groups. These functions can be used to rank the variables in terms of their relative contribution to group separation [7]. The aim of linear discriminant analysis is to obtain N functions (where N is the number of variables) that transform the original y variables in the new z variables that best separate the K groups. zi  ai ' y  ai1 y1  ai 2 y2  ...  aiN y N i  1,2,..., N To offset differing scales among the variables, the discriminant function coefficients can be standardized using: y  yN y y y  y2 zi  ai1* 1 1  ai 2* 2  ...  aiN * N s1 s2 sN Where y is the variable mean and s is the standard deviation of the variable. The standardized variables are scale free. The standardized coefficients indicate the influence of each variable in the presence of the others. The absolute values of the coefficients ar* can be used to rank the variables in order of their contribution to separating groups. To calculate the coefficients we use Fisher linear discriminant analysis and then we rank the coefficients of the first discriminant function. 4. Results The sentence “Mi abuelo me animó a estudiar solfeo” uttered by eleven speakers has been analyzed. Analysis results are shown in the following figures. 0 500 1000 1500 2000 2500 Figure 8. Return coefficient of the eleven speakers. 20 40 60 80 100 Figure 6. Open quotient of the eleven speakers. 0 1 2 3 4 Figure 9. Velocity coefficient of the eleven speakers. 0 0.1 0.2 0.3 0.4 Figure 7. SKEW of the eleven speakers 0.5 0.6 In the previous figures the right and left lines of the ‘boxes’ are the 25th and 75th percentiles of the sample. The distance between them is the interquartile range. The line in the middle of the box is the sample median . Linear discriminant analysis is applied to all the combinations of speakers taken from 2 by 2 to all together. When using the parameters for the whole utterance linear discriminant does not work. It only separates males of females taking into account F0 differences. We performed the same analysis phone by phone, considering only the voiced sounds. All the speakers for all combinations were successfully classified. We consider the first discriminant function to rank the variables and count how many times the variable is the most relevant and how many times it is the least relevant. Results are shown in figure 10. more relevant more relevant 15000 200 10000 100 5000 0 F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE 0 less relevant F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE less relevant 10000 100 5000 0 50 F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE Figure 10. Results of linear discriminant analysis. We can see that the most relevant parameter is F0 followed by formants and OQ. The least significant parameter is the velocity coefficient. If we use only the vowels the results are quite similar as we can see in figure 11. more relevant 10000 5000 0 F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE Figure 13. Results for male speakers. 5. Conclusions We have tried to show that the parameters we have considered in a speaker conversion task are relevant for speaker identity. Our results show that the order of importance depends on the type of speech we are considering, the gender of the speaker and the type of phones. Results show that F0 , formant frequencies and OQ are the most important parameters for speaker classification. 6. References 0 F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE less relevant 8000 6000 4000 2000 0 F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE Figure 11. Results using only vowels. When considering only female speakers results change significantly. We can see in figure 12 that the most relevant parameter is F5 followed by B2, F2 and OQ. Results obtained for male speakers are shown in figure 13. more relevant 100 50 0 F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE less relevant 100 50 0 F0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6OQRESKVE Figure 12. Results for female speakers. [1] Schwartz,M.F., Rine,H.E. “Identification of speaker sex from isolated, whispered vowels”. J. Acoust. Soc. Amer., vol. 44, pp 1736-1737. 1968. [2] Childers,D.G., Lee,C.K. “Vocal quality factors; Analysis, synthesis and perception”. J. Acoust. Soc. Amer., vol.90 (5), pp 2394-2410. 1991. [3] Bachorowski,J.A., Owren,M.J. “Acoustic correlates of talker sex and individual talker identity are present in a short vowel segment produced in running speech”. J. Acoust. Soc. Amer., vol 106 (2), pp 1054-1063. 1999. [4] Childers, D.G., Hu, H.T. “Speech Synthesis by glottal excited linear prediction”. J. Acoust. Soc. Amer., vol.96, pp 2026-2036. 1994 [5] Gutiérrez-Arriola, J.M., Giménez de los Galanes, F., Savoji, M.H., Pardo, J.M. “Speech synthesis and prosody modification using segmentation and modelling of the excitation signal”. Proc. Eurospeech'97, vol. 2, pp 10591062. Rodas, 1997 [6] Gutiérrez-Arriola, J.M., Hsiao, Y.S., Montero, J.M., Pardo, J.M., Childers, D.G. “Voice Conversion Based on Parameter Transformation”. Proc. ICSLP'98, vol. 3, pp 987-990. Sidney, 1998 [7] Rencher, A.C. “Methods of multivariate analysis”. Ed. John Wiley & Sons. 1995

AI-82Ana-03 - Grupo de Tecnología del Habla

Related documents

Products

Support

AI-82Ana-03 - Grupo de Tecnología del Habla

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib