Spectral Analysis of Male and Female Speech Signals Omar Zelmati, Boban Bondžul , Milenko Andri , Dimitrije Bujakovi Abstract—In this paper a spectral analysis is applied on a male and female speech audio database. The effect of unsounded audio signal parts on the spectrum rate is studied and it is shown that silent parts disturb strongly the spectral analysis and these parts should be deleted. A comparison between different spectra is made based on correlation. For the case of the spectrums that origins from the same speaker, it is shown that these spectrums are strongly correlated, while a significant correlation between spectrums of the same speaker’s gender is highlighted. Finally, the effect of audio signal duration on the spectrums correlation is discussed. The obtained results are very promising and can be used in several fields of speech signal analysis such as speaker recognition and speaker gender identification. Index Terms— Spectral analysis, Speech signal, Correlation. I. INTRODUCTION ANALYSIS of speech signal has various applications such as speaker identification, automatic speech recognition, speaker gender recognition, speech enhancement, etc. In recent years, many researches have been carried out in order to take some advantages of the spectral analysis benefits over the speech time analysis. As it is stated in [1], spectral analysis (known also “Fourier representation”) is often used to highlight certain properties of the speech signal that may be hidden or less obvious if the signal is represented in time domain. These properties are extracted from a spectrum using different techniques and, furthermore, they can be exploited in various methods according to the application and the purpose of the research. Various approaches and implementations in order to extract features of spectrum have been studied in literature [2], such as Discrete Fourier Transform (DFT), and its fast implementations: Fast Fourier Transform (FFT) and ShortTime Fourier Transform (STFT). Regardless to the signal type, Fourier representation aims to decompose it into its frequency components [3]. For the special case of speaker recognition, numerous signal decomposition techniques based on DFT are proposed. Moreover, some alternatives such as non-harmonic bases, aperiodic functions and data-driven bases derived from independent component analysis have been discussed, and their effectiveness has been placed in evidence [4-6]. Omar Zelmati is with the Military Academy, University of Defence in Belgrade, 33 Generala Pavla Juriši a Šturma, 11000 Belgrade, Serbia (e-mail: omarzelmati1991@gmail.com). Boban Bondžuli is with the Military Academy, University of Defence in Belgrade, 33 Generala Pavla Juriši a Šturma, 11000 Belgrade, Serbia (e-mail: bondzulici@yahoo.com). Milenko Andri is with the Military Academy, University of Defence in Belgrade, 33 Generala Pavla Juriši a Šturma, 11000 Belgrade, Serbia (e-mail: andricsmilenko@gmail.com). Dimitrije Bujakovi is with the Military Academy, University of Defence in Belgrade, 33 Generala Pavla Juriši a Šturma, 11000 Belgrade, Serbia (email: dimitrije.bujakovic@va.mod.gov.rs). However, because of its simplicity and efficiency, DFT is used in practice and usually only the magnitude spectrum is considered, based on the belief that phase has little perceptual importance [7]. The overall shape of the DFT magnitude spectrum, called spectral envelope contains information on the resonance properties of the vocal tract and has been found to be the most informative part of the spectrum for purpose of speaker recognition [1]. Dynamic spectral features (spectral transition) as well as instantaneous spectral features play an important role in human speech perception [8] and many researches on feature extraction for isolated word recognition have investing the effectiveness of using spectral features. In [9] it is shown that Mel Frequency Cepstral Coefficients (MFCCs) are the robust features for isolated word recognition. In this research these coefficients are calculated in six steps: pre-emphasis, applying Hamming windowing, frame blocking, FFT, Log Mel spectrum calculation and applying Discrete Cosine Transform. Furthermore, MFCCs and other measures like pitch, log energy and mel-band energy, have been fixed as based features for emotion recognition using speech signal in [10]. Linear Prediction Coding (LPC) method is applied for speaker gender recognition in [11]. LPC method is such a filter applied on the FFT in order to spectrally flatten the input signal. Beyond its use in gender recognition, this method has been used in speech enhancement and particularly for noise reduction [12]. In this research the spectral analysis of recently formed database of audio signals is performed. Spectral analyses are done regard to the speaker gender (male or female) and as the measure of spectral properties of different speakers the correlation between spectrums of the recorded audio signals is used. These correlations are analyzed for different speakers and for different texts. Beside this, the effects of the audio signals duration are also analyzed through the correlation between male and female speakers. The rest of paper is organized as follow: in the Section II the used audio signals database is described; in Section III it is applied spectral analysis based on FFT for audio recordings of used database through three analyses: speaker-based correlation, text-based correlation and analysis of the duration of speech signals on the correlation between various speakers. In the last part of this paper, results are discussed and some conclusions are highlighted with some direction of further research. II. DESCRIPTION OF SPEECH SIGNALS DATABASE For the purposes of the male and female speech signal analysis, an audio recordings database is created with the recording of five already prepared texts in the Serbian language read by five male and five female speakers. As each speaker read five texts, the complete database consists of 50 audio records. The duration of each record is about 30 seconds. All speakers are aged between 20 and 25 years (students of the Military Academy in Belgrade). In order to guarantee the same environmental conditions of recording, the recording is done in the same place, in a sound-isolated room. All voice recordings were recorded using the SpectraLAB software package on a DELL laptop, with sampling rate fs = 8 kHz and 16-bit resolution. As input, the microphone of the headphones for the VoIP communication Genius HS04S is used. The sensitivity of used microphone is –60 dB, while its frequency response is within 50 Hz and 20000 Hz. Audio recordings contain words used in military terminology (gun, pistol, airplane, attack, defense, etc.), so it represents a good basis that in further research they can be used for isolated word recognition or for speaker identification. lower [13]. If the Xi (k), k = 0,…, N-1 are the DFT coefficients of the i-th frame of length N, the spectral centroid is calculated as: N 1 (k 1) X i ( k ) Ci (2) . N 1 X i (k ) k 0 After determining all voiced segments of the audio signal, the new-voiced signal is created using concatenation of all voiced segments of the original audio signal. Furthermore, the FFT is applied on each modified signal. Magnitudes of the FFT of an audio signal from created database before and after silence removal are shown in Fig. 1. 1 0.9 0.8 III. SPEECH SIGNALS SPECTRAL ANALYSIS Magnitude 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500 Frequency (Hz) 3000 3500 4000 3000 3500 4000 a) 1 0.9 0.8 0.7 Magnitude In this research the FFT of the recorded audio set is applied in order to calculate the correlation between the different obtained spectrums. Although the recording is done in special conditions, the segmentation step is necessary in order to eliminate the effect of the background noise in the spectrum form. All analyzed speech signals are preprocessed due to elimination of the audio signal silent parts. The algorithm for silence removal is proposed in [13] and in this part of the paper is briefly described. The method of silence removal is based on two audio features, the signal energy and the spectral centroid. The applied algorithm consists of four stages: 1) Extraction of the signal energy and spectral centroid from the already decomposed sequences of audio signal; 2) Thresholds estimation for each sequence, where two thresholds are calculated on the base of the extracted features; 3) Application of thresholds criterion on the audio signals sequences, and 4) Speech segments detection based on the threshold criterion and post-processing. Firstly, the audio signal is divided into non-overlapping frames of the same duration (in this research the frame duration is 50 ms), where si (n), n = 0,…, N-1 are the audio samples of the i-th frame of length N. The energy of one frame of audio signal can be calculated using: k 0 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500 Frequency (Hz) b) 1 N1 2 (1) Ei si ( n ) . Nn0 This energy is used to detect the silent frames presented in the audio signal, based on the assumption that, if the level of background noise is not very high, the energy of the voice segments is significantly greater than the energy of the silent segments [13]. Silent segments may contain environmental sounds, and for that reason, the measurement of the spectral centroid is performed. The spectral centroid of low energy segments is much smaller, as these noisy sounds tend to have lower frequencies and, as a result, spectral centroid values are Fig. 1. Normalized spectrum of the audio signal: a) before silence removal, b) after silence removal. Comparing spectrums before silence removal (Fig. 1.a) and after silence removal (Fig. 1.b), it can be noticed that silent parts strongly affect the spectrum shape. After silence removal, the magnitudes of frequencies above 1800 Hz are suppressed. The correlation between the spectrums of audio signal before and after silence removal is about 20%. A. Speaker-based Correlation In the first analysis, the spectrums of audio signals after silence removal of one speaker that speak different texts are compared. The normalized spectrums of audio signals from a one male (speaker 3) and one female (speaker 7) extracted from used database are shown in Fig. 2. 1 Text 1 0.5 0 0 500 1000 1500 2000 2500 3000 3500 4000 1 Text 2 0.5 0 0 500 1000 1500 2000 2500 3000 3500 4000 Magnitude 1 Text 3 0.5 0 0 500 1000 1500 2000 2500 3000 3500 4000 1 Text 4 0.5 0 0 500 1000 1500 2000 2500 3000 3500 4000 1 Text 5 0.5 0 0 500 1000 1500 2000 Frequency (Hz) 2500 3000 3500 4000 a) 1 Text 1 0.5 0 0 500 1000 1500 2000 2500 3000 3500 4000 1 Text 2 0.5 0 0 500 1000 1500 2000 2500 3000 3500 4000 Magnitude 1 Text 3 0.5 0 0 500 1000 1500 2000 2500 3000 3500 4000 1 Text 4 0.5 0 0 500 1000 1500 2000 2500 3000 3500 4000 1 Text 5 0.5 0 0 500 1000 1500 2000 Frequency (Hz) 2500 3000 3500 b) Fig. 2. Spectrums of the different audio speech signals for the same: a) male speaker (speaker 3) and b) female speaker (speaker 7). 4000 From the Fig. 2, it can be noticed that signals origin from the same speaker and different read texts, have the similar spectral shape. Analyzing the spectral properties of male speaker (Fig. 2.a), it can be noticed that spectral components are wider, while for female speaker (Fig. 2.b), spectral components are concentrated around 200 Hz and 400 Hz. In order to quantify spectral similarity, the correlation of spectrum of audio signals produced by male and female speaker is calculated with regard to the spoken texts with and without silence removal. Results of this analysis are shown in Table I (male speaker) and Table II (female speaker). TABLE I SPECTRUM BASED CORRELATION FOR THE MALE SPEAKER Text 4 Text 5 Text 1 1 Text 2 0.48 1 0.47 0.40 0.43 Text 3 0.46 0.47 1 0.39 0.43 Text 4 0.41 0.40 0.39 1 0.40 Text 5 0.45 0.43 0.43 0.40 With silence removal 0.69 0.68 0.66 0.45 1 1 Text 2 0.69 1 0.71 0.68 0.67 Text 3 0.68 0.71 1 0.70 0.68 Text 4 0.66 0.68 0.70 1 0.70 Text 5 0.64 0.67 0.68 0.70 1 SPECTRUM BASED CORRELATION FOR THE FEMALE SPEAKER Text 1 1 Text 2 0.37 Text 3 0.36 Text 3 Text 4 Without silence removal 0.37 0.36 0.36 1 0.37 0.37 1 0.38 0.37 Text 4 0.36 0.38 0.37 1 Text 5 0.32 0.32 0.31 0.32 Text 5 0.32 0.31 0.32 1 0.70 0.68 0.68 0.65 Text 2 0.70 1 0.70 0.69 0.70 Text 3 0.68 0.70 1 0.71 0.71 Text 4 0.68 0.69 0.71 1 0.72 Text 5 0.65 0.70 0.71 0.72 1 Comparing results of spectral correlation presented in Table I and II, a high correlation can be noticed between the spectrums of the same speaker for different texts read either for male or for female in the case after silence removal while it is much lower for the case without silence removal. The highest correlation value is obtained for the female speaker (72%), while the lowest correlation is obtained for 4 Female speakers 5 6 7 8 1 1 9 10 0.21 0.36 0.33 0.37 0.23 0.22 0.17 0.21 0.10 2 0.21 3 0.36 0.25 4 0.33 0.23 0.38 5 0.37 0.24 0.33 0.35 6 0.23 0.14 0.28 0.22 0.20 7 0.22 0.10 0.24 0.20 0.18 0.31 8 0.17 0.14 0.29 0.24 0.23 0.28 0.27 9 0.21 0.16 0.33 0.26 0.22 0.36 0.30 0.37 1 0.25 0.23 0.24 0.14 0.10 0.14 0.16 0.08 1 0.38 0.33 0.28 0.24 0.29 0.33 0.14 1 0.35 0.22 0.20 0.24 0.26 0.15 1 0.20 0.18 0.23 0.22 0.19 1 0.31 0.28 0.36 0.14 1 0.27 0.30 0.15 1 0.37 0.29 1 10 0.10 0.08 0.14 0.15 0.19 0.14 0.15 0.29 0.21 0.32 Female speakers 1 3 0.21 1 With silence removal With silence removal Text 1 2 Without silence removal 0.64 TABLE II Text 2 Male speakers 1 Text 1 Text 1 TABLE III SPECTRAL CORRELATION OF TEXT 1 Male speakers Text 3 Without silence removal 0.48 0.46 0.41 Female speakers Text 2 B. Text-based Correlation In order to determine the correlation between the spectrums of the same text uttered by different speakers, the correlation between the spectrums of different speakers is calculated on the text 1 with and without silence removal. The results of spectral correlations are shown in Table III. The part of this table colored in blue represents the correlation between spectrums of five male speakers for the read text, while the part colored in orange represents the correlation between spectrums of all the five female speakers for the same text. The yellow part of Table III is the spectral correlation between males and female speakers. Male speakers Text 1 the male speaker (64%). However, the average spectral correlation for the male and for the female speaker is nearly the same: 68% for the male and 69% for the female speaker. From this, it can be concluded that spectral shape after silence removal may be used for feature extraction in order to perform speaker recognition. 1 1 2 0.53 0.53 0.54 0.59 0.55 0.44 0.43 0.32 0.38 0.31 3 0.54 0.56 4 0.59 0.61 0.60 5 0.55 0.55 0.50 0.57 6 0.44 0.43 0.49 0.50 0.39 7 0.43 0.35 0.44 0.46 0.34 0.63 8 0.32 0.40 0.43 0.46 0.37 0.52 0.50 9 0.38 0.45 0.51 0.51 0.39 0.66 0.58 0.58 1 0.56 0.61 0.55 0.43 0.35 0.40 0.45 0.36 1 0.60 0.50 0.49 0.44 0.43 0.51 0.37 1 0.57 0.50 0.46 0.46 0.51 0.43 1 0.39 0.34 0.37 0.39 0.42 1 0.63 0.52 0.66 0.44 1 0.50 0.58 0.42 1 0.58 0.59 1 10 0.31 0.36 0.37 0.43 0.42 0.44 0.42 0.59 0.51 0.51 1 From the Table III, it can be noticed that the average spectral correlation of the different male speakers after applying silence removal is about 56% while for different female speakers is about 54%. The average correlation of the male-female part is about 42%. From this, it can be concluded that spectrums of the same gender are notably correlated, whilst spectrums of different gender are less C. Effect of Audio Signal Duration For the purpose of the audio signal duration effect investigation on the spectrum correlation, ten signals were prepared by concatenating all the uttered texts for each speaker. In such manner, each speaker is presented by a larger signal. The spectrum of this signal after silence removal is determined and the correlation between it and original signals is calculated. The results are shown in Table IV. TABLE IV the table, while the orange part serves to calculate the female vs. female average correlation and the part in yellow is used to calculate the male vs. female correlation. Results of this analysis are presented in Fig. 3. From the Fig. 3, it can be noticed that for higher duration of analyzed signal the spectral correlation between speakers is higher with silence removal than without it. Beside this, from this figure it can be concluded that spectral correlation after silence removal is higher to the speaker gender, while by comparing audio signals that origin from the speakers of different gender, it is concluded that the spectral correlation is lower. These results can be used for determination of the optimal speech recording duration in data training of speech analysis datasets. 0.7 male vs. male female vs female male vs female 0.6 Correlation correlated. This can be explained by the fact that the energy of the audio origins from female speaker is concentrated around 200 Hz and 400 Hz, while energy for male speaker is distributed on larger frequency range. The average spectral correlation of all male speakers for the case without silence removal is about 30% and for all female speakers it is about 27%. These results confirm that the correlation decreases significantly because of the parts of the signal having a low energy (unsounded parts). These conclusions may be used for feature extraction in order to perform gender recognition using speech signal. CORRELATION BETWEEN SPECTRUMS OF SIGNALS SUM AND ORIGINAL ONES 0.5 0.4 0.3 0.2 1 Text1 0.77 Text2 0.75 Text3 0.81 Text4 0.72 Text5 0.70 2 0.76 0.80 0.78 0.81 0.77 3 0.81 0.73 0.75 0.76 0.76 4 0.85 0.84 0.81 0.77 0.72 5 0.73 0.79 0.67 0.81 0.80 6 0.82 0.69 0.68 0.66 0.64 7 0.79 0.78 0.71 0.80 0.68 8 0.78 0.71 0.76 0.70 0.78 9 0.81 0.68 0.75 0.79 0.70 10 0.74 0.82 0.77 0.70 0.83 Text 1 Text 3 Text 4 Text 5 All texts Text 5 All texts 0.75 male vs male female vs female male vs female 0.7 0.65 0.6 0.55 0.5 0.45 0.4 Text 1 From the Table IV, it can be noticed that the correlation is enhanced using longer speech signal duration. Analyzing results from the Table I, it can be noticed that the average correlation between the different texts of the speaker 3 (with silence removal) is around 68%. On the other hand, the average correlation of the same speaker calculated based on Table IV is about 76%. From this, it can be concluded that if the audio signal is longer, the spectral correlation after silence removal is higher. In order to support this conclusion, correlation matrices for each of the five analyzed texts and for all concatenated texts are calculated. From each obtained matrix, the average correlation between spectrums of male speakers (male vs. male), the average correlation between spectrums of female speakers (female vs. female) and the average correlation between speakers of different gender (male vs. female) are calculated. Referring to the Table III the male vs. male average correlation is computed based on the blue part of Text 2 a) Correlation All texts Speaker Text 2 Text 3 Text 4 b) Fig. 3. Average correlation for the different texts and speaker gender: a) without silence removal and b) with silence removal. IV. CONCLUSION This paper presented a spectral analysis applied on recently created database that consists of male and female speech samples. Using the spectral correlation measure, in this research it is analyzed the effect of unsounded audio signal parts on the spectrum shape and it is shown that silent parts strongly affects the results of a spectral analysis. Beside this, in this research the spectrums obtained from signals of the same speaker for different uttered texts are compared and it is shown that they are strongly correlated. Moreover, there is a significant correlation between spectrums obtained from speakers of different gender reading the same text. In this research is also analyzed the effect of the audio signal duration on the spectrum correlation. Obtained results show that for longer duration signal, the spectral correlation is higher. These results may be ground for future researches related to the speech signal analysis. In future research, the problem of spectral-based feature extraction for speaker identification will be considered. Furthermore, the case of background noise, music and distant speakers will be taken into account. REFERENCES [1] [2] [3] [4] [5] L. R. Rabiner and R. W. Schafer, Theory and applications of digital speech processing, NJ, USA: Pearson, 2011. A. V. Oppenheim, Discrete-time signal processing, ND, India: Pearson Education India, 1999. L. R. Rabiner and R. W. Schafer, Digital processing of speech signals, New Jersey, USA: Prentice-Hall, 1978. K. Gopalan, T. R. Anderson, and E. J. Cupples, "A comparison of speaker identification results using features based on cepstrum and Fourier-Bessel expansion," IEEE Trans. on Speech and Audio Proc, vol. 7, no. 3, pp. 289-294. Apr. 1999. B. Imperl, Z. Kai, and B. Horvat, "A study of harmonic features for the speaker recognition," Speech Communication, vol. 22, no. 4, pp. 385-402. Feb. 1997. [6] [7] [8] [9] [10] [11] [12] [13] G.-J. Jang, T.-W. Lee, and Y.-H. Oh, "Learning statistically efficient features for speaker recognition," Neurocomputing, vol. 49, no. 4, pp. 329-348. Jun. 2002. T. Kinnunen and H. Li, "An overview of text-independent speaker recognition: From features to supervectors," Speech Communication, vol. 52, no. 1, pp. 12-40. May 2010. G. Ruske and T. Schotola, "The efficiency of demisyllable segmentation in the recognition of spoken words," in ICASSP'81. IEEE International Conference on Acoustics, Speech, and Signal Processing, NY, USA, vol. 1, pp. 971-974. 04-01-1981. M. P. Kesarkar, "Feature extraction for speech recognition," M.Tech. Credit Seminar Report, Bombay, India, 2003. O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee, "Emotion recognition by speech signals," European Conference on Speech Communication and Technology. Geneva, Switzerland, interspeech. 09-01-2003. K. Rakesh, S. Dutta, and K. Shama, "Gender Recognition using speech processing techniques in LABVIEW," Internat. Journal of Advs. in Engin. & Tech., vol. 1, no. 2, p. 51. Jul. 2011. M. Hydari, M. R. Karami, and E. Nadernejad, "Speech Signals Enhancement Using LPC Analysis based on Inverse Fourier Methods," Contemporary Engineering Sciences, vol. 2, no. 1, pp. 115. Jan. 2009. T. Giannakopoulos, "Study and application of acoustic information for the detection of harmful content, and fusion with visual information," PhD. Dissertation, IT Dept., NKA Univ., Athens, Greece, 2009.