Uploaded by zelmatiomar1991

ETRAN2019 Omar after reviewing

advertisement
Spectral Analysis of Male and Female
Speech Signals
Omar Zelmati, Boban Bondžul , Milenko Andri , Dimitrije Bujakovi
Abstract—In this paper a spectral analysis is applied on a male
and female speech audio database. The effect of unsounded audio
signal parts on the spectrum rate is studied and it is shown that
silent parts disturb strongly the spectral analysis and these parts
should be deleted. A comparison between different spectra is
made based on correlation. For the case of the spectrums that
origins from the same speaker, it is shown that these spectrums
are strongly correlated, while a significant correlation between
spectrums of the same speaker’s gender is highlighted. Finally,
the effect of audio signal duration on the spectrums correlation is
discussed. The obtained results are very promising and can be
used in several fields of speech signal analysis such as speaker
recognition and speaker gender identification.
Index Terms— Spectral analysis, Speech signal, Correlation.
I. INTRODUCTION
ANALYSIS of speech signal has various applications such
as speaker identification, automatic speech recognition,
speaker gender recognition, speech enhancement, etc. In
recent years, many researches have been carried out in order
to take some advantages of the spectral analysis benefits over
the speech time analysis.
As it is stated in [1], spectral analysis (known also “Fourier
representation”) is often used to highlight certain properties of
the speech signal that may be hidden or less obvious if the
signal is represented in time domain. These properties are
extracted from a spectrum using different techniques and,
furthermore, they can be exploited in various methods
according to the application and the purpose of the research.
Various approaches and implementations in order to extract
features of spectrum have been studied in literature [2], such
as Discrete Fourier Transform (DFT), and its fast
implementations: Fast Fourier Transform (FFT) and ShortTime Fourier Transform (STFT).
Regardless to the signal type, Fourier representation aims to
decompose it into its frequency components [3]. For the
special case of speaker recognition, numerous signal
decomposition techniques based on DFT are proposed.
Moreover, some alternatives such as non-harmonic bases,
aperiodic functions and data-driven bases derived from
independent component analysis have been discussed, and
their effectiveness has been placed in evidence [4-6].
Omar Zelmati is with the Military Academy, University of Defence in
Belgrade, 33 Generala Pavla Juriši a Šturma, 11000 Belgrade, Serbia (e-mail:
omarzelmati1991@gmail.com).
Boban Bondžuli is with the Military Academy, University of Defence in
Belgrade, 33 Generala Pavla Juriši a Šturma, 11000 Belgrade, Serbia (e-mail:
bondzulici@yahoo.com).
Milenko Andri is with the Military Academy, University of Defence in
Belgrade, 33 Generala Pavla Juriši a Šturma, 11000 Belgrade, Serbia (e-mail:
andricsmilenko@gmail.com).
Dimitrije Bujakovi is with the Military Academy, University of Defence
in Belgrade, 33 Generala Pavla Juriši a Šturma, 11000 Belgrade, Serbia (email: dimitrije.bujakovic@va.mod.gov.rs).
However, because of its simplicity and efficiency, DFT is
used in practice and usually only the magnitude spectrum is
considered, based on the belief that phase has little perceptual
importance [7]. The overall shape of the DFT magnitude
spectrum, called spectral envelope contains information on the
resonance properties of the vocal tract and has been found to
be the most informative part of the spectrum for purpose of
speaker recognition [1].
Dynamic spectral features (spectral transition) as well as
instantaneous spectral features play an important role in
human speech perception [8] and many researches on feature
extraction for isolated word recognition have investing the
effectiveness of using spectral features. In [9] it is shown that
Mel Frequency Cepstral Coefficients (MFCCs) are the robust
features for isolated word recognition. In this research these
coefficients are calculated in six steps: pre-emphasis, applying
Hamming windowing, frame blocking, FFT, Log Mel
spectrum calculation and applying Discrete Cosine
Transform. Furthermore, MFCCs and other measures like
pitch, log energy and mel-band energy, have been fixed as
based features for emotion recognition using speech signal in
[10].
Linear Prediction Coding (LPC) method is applied for
speaker gender recognition in [11]. LPC method is such a
filter applied on the FFT in order to spectrally flatten the input
signal. Beyond its use in gender recognition, this method has
been used in speech enhancement and particularly for noise
reduction [12].
In this research the spectral analysis of recently formed
database of audio signals is performed. Spectral analyses are
done regard to the speaker gender (male or female) and as the
measure of spectral properties of different speakers the
correlation between spectrums of the recorded audio signals is
used. These correlations are analyzed for different speakers
and for different texts. Beside this, the effects of the audio
signals duration are also analyzed through the correlation
between male and female speakers.
The rest of paper is organized as follow: in the Section II
the used audio signals database is described; in Section III it is
applied spectral analysis based on FFT for audio recordings of
used database through three analyses: speaker-based
correlation, text-based correlation and analysis of the duration
of speech signals on the correlation between various speakers.
In the last part of this paper, results are discussed and some
conclusions are highlighted with some direction of further
research.
II. DESCRIPTION OF SPEECH SIGNALS DATABASE
For the purposes of the male and female speech signal
analysis, an audio recordings database is created with the
recording of five already prepared texts in the Serbian
language read by five male and five female speakers. As each
speaker read five texts, the complete database consists of 50
audio records. The duration of each record is about 30
seconds.
All speakers are aged between 20 and 25 years (students of
the Military Academy in Belgrade). In order to guarantee the
same environmental conditions of recording, the recording is
done in the same place, in a sound-isolated room. All voice
recordings were recorded using the SpectraLAB software
package on a DELL laptop, with sampling rate fs = 8 kHz and
16-bit resolution. As input, the microphone of the headphones
for the VoIP communication Genius HS04S is used. The
sensitivity of used microphone is –60 dB, while its frequency
response is within 50 Hz and 20000 Hz.
Audio recordings contain words used in military
terminology (gun, pistol, airplane, attack, defense, etc.), so it
represents a good basis that in further research they can be
used for isolated word recognition or for speaker
identification.
lower [13]. If the Xi (k), k = 0,…, N-1 are the DFT coefficients
of the i-th frame of length N, the spectral centroid is
calculated as:
N 1
(k 1) X i ( k )
Ci
(2)
.
N 1
X i (k )
k 0
After determining all voiced segments of the audio signal,
the new-voiced signal is created using concatenation of all
voiced segments of the original audio signal. Furthermore, the
FFT is applied on each modified signal. Magnitudes of the
FFT of an audio signal from created database before and after
silence removal are shown in Fig. 1.
1
0.9
0.8
III. SPEECH SIGNALS SPECTRAL ANALYSIS
Magnitude
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
500
1000
1500
2000
2500
Frequency (Hz)
3000
3500
4000
3000
3500
4000
a)
1
0.9
0.8
0.7
Magnitude
In this research the FFT of the recorded audio set is applied
in order to calculate the correlation between the different
obtained spectrums. Although the recording is done in special
conditions, the segmentation step is necessary in order to
eliminate the effect of the background noise in the spectrum
form. All analyzed speech signals are preprocessed due to
elimination of the audio signal silent parts. The algorithm for
silence removal is proposed in [13] and in this part of the
paper is briefly described. The method of silence removal is
based on two audio features, the signal energy and the spectral
centroid. The applied algorithm consists of four stages:
1) Extraction of the signal energy and spectral centroid
from the already decomposed sequences of audio signal;
2) Thresholds estimation for each sequence, where two
thresholds are calculated on the base of the extracted features;
3) Application of thresholds criterion on the audio signals
sequences, and
4) Speech segments detection based on the threshold
criterion and post-processing.
Firstly, the audio signal is divided into non-overlapping
frames of the same duration (in this research the frame
duration is 50 ms), where si (n), n = 0,…, N-1 are the audio
samples of the i-th frame of length N. The energy of one
frame of audio signal can be calculated using:
k 0
0.6
0.5
0.4
0.3
0.2
0.1
0
0
500
1000
1500
2000
2500
Frequency (Hz)
b)
1 N1
2
(1)
Ei
si ( n ) .
Nn0
This energy is used to detect the silent frames presented in
the audio signal, based on the assumption that, if the level of
background noise is not very high, the energy of the voice
segments is significantly greater than the energy of the silent
segments [13]. Silent segments may contain environmental
sounds, and for that reason, the measurement of the spectral
centroid is performed. The spectral centroid of low energy
segments is much smaller, as these noisy sounds tend to have
lower frequencies and, as a result, spectral centroid values are
Fig. 1. Normalized spectrum of the audio signal: a) before silence removal,
b) after silence removal.
Comparing spectrums before silence removal (Fig. 1.a) and
after silence removal (Fig. 1.b), it can be noticed that silent
parts strongly affect the spectrum shape. After silence
removal, the magnitudes of frequencies above 1800 Hz are
suppressed. The correlation between the spectrums of audio
signal before and after silence removal is about 20%.
A. Speaker-based Correlation
In the first analysis, the spectrums of audio signals after
silence removal of one speaker that speak different texts are
compared. The normalized spectrums of audio signals from a
one male (speaker 3) and one female (speaker 7) extracted
from used database are shown in Fig. 2.
1
Text 1
0.5
0
0
500
1000
1500
2000
2500
3000
3500
4000
1
Text 2
0.5
0
0
500
1000
1500
2000
2500
3000
3500
4000
Magnitude
1
Text 3
0.5
0
0
500
1000
1500
2000
2500
3000
3500
4000
1
Text 4
0.5
0
0
500
1000
1500
2000
2500
3000
3500
4000
1
Text 5
0.5
0
0
500
1000
1500
2000
Frequency (Hz)
2500
3000
3500
4000
a)
1
Text 1
0.5
0
0
500
1000
1500
2000
2500
3000
3500
4000
1
Text 2
0.5
0
0
500
1000
1500
2000
2500
3000
3500
4000
Magnitude
1
Text 3
0.5
0
0
500
1000
1500
2000
2500
3000
3500
4000
1
Text 4
0.5
0
0
500
1000
1500
2000
2500
3000
3500
4000
1
Text 5
0.5
0
0
500
1000
1500
2000
Frequency (Hz)
2500
3000
3500
b)
Fig. 2. Spectrums of the different audio speech signals for the same: a) male speaker (speaker 3) and b) female speaker (speaker 7).
4000
From the Fig. 2, it can be noticed that signals origin from
the same speaker and different read texts, have the similar
spectral shape. Analyzing the spectral properties of male
speaker (Fig. 2.a), it can be noticed that spectral components
are wider, while for female speaker (Fig. 2.b), spectral
components are concentrated around 200 Hz and 400 Hz. In
order to quantify spectral similarity, the correlation of
spectrum of audio signals produced by male and female
speaker is calculated with regard to the spoken texts with
and without silence removal. Results of this analysis are
shown in Table I (male speaker) and Table II (female
speaker).
TABLE I
SPECTRUM BASED CORRELATION FOR THE MALE SPEAKER
Text 4
Text 5
Text 1
1
Text 2
0.48
1
0.47
0.40
0.43
Text 3
0.46
0.47
1
0.39
0.43
Text 4
0.41
0.40
0.39
1
0.40
Text 5
0.45
0.43
0.43
0.40
With silence removal
0.69
0.68
0.66
0.45
1
1
Text 2
0.69
1
0.71
0.68
0.67
Text 3
0.68
0.71
1
0.70
0.68
Text 4
0.66
0.68
0.70
1
0.70
Text 5
0.64
0.67
0.68
0.70
1
SPECTRUM BASED CORRELATION FOR THE FEMALE SPEAKER
Text 1
1
Text 2
0.37
Text 3
0.36
Text 3
Text 4
Without silence removal
0.37
0.36
0.36
1
0.37
0.37
1
0.38
0.37
Text 4
0.36
0.38
0.37
1
Text 5
0.32
0.32
0.31
0.32
Text 5
0.32
0.31
0.32
1
0.70
0.68
0.68
0.65
Text 2
0.70
1
0.70
0.69
0.70
Text 3
0.68
0.70
1
0.71
0.71
Text 4
0.68
0.69
0.71
1
0.72
Text 5
0.65
0.70
0.71
0.72
1
Comparing results of spectral correlation presented in
Table I and II, a high correlation can be noticed between the
spectrums of the same speaker for different texts read either
for male or for female in the case after silence removal
while it is much lower for the case without silence removal.
The highest correlation value is obtained for the female
speaker (72%), while the lowest correlation is obtained for
4
Female speakers
5
6
7
8
1
1
9
10
0.21 0.36 0.33 0.37 0.23 0.22 0.17 0.21 0.10
2
0.21
3
0.36 0.25
4
0.33 0.23 0.38
5
0.37 0.24 0.33 0.35
6
0.23 0.14 0.28 0.22 0.20
7
0.22 0.10 0.24 0.20 0.18 0.31
8
0.17 0.14 0.29 0.24 0.23 0.28 0.27
9
0.21 0.16 0.33 0.26 0.22 0.36 0.30 0.37
1
0.25 0.23 0.24 0.14 0.10 0.14 0.16 0.08
1
0.38 0.33 0.28 0.24 0.29 0.33 0.14
1
0.35 0.22 0.20 0.24 0.26 0.15
1
0.20 0.18 0.23 0.22 0.19
1
0.31 0.28 0.36 0.14
1
0.27 0.30 0.15
1
0.37 0.29
1
10 0.10 0.08 0.14 0.15 0.19 0.14 0.15 0.29 0.21
0.32
Female speakers
1
3
0.21
1
With silence removal
With silence removal
Text 1
2
Without silence removal
0.64
TABLE II
Text 2
Male speakers
1
Text 1
Text 1
TABLE III
SPECTRAL CORRELATION OF TEXT 1
Male speakers
Text 3
Without silence removal
0.48
0.46
0.41
Female speakers
Text 2
B. Text-based Correlation
In order to determine the correlation between the
spectrums of the same text uttered by different speakers, the
correlation between the spectrums of different speakers is
calculated on the text 1 with and without silence removal.
The results of spectral correlations are shown in Table III.
The part of this table colored in blue represents the
correlation between spectrums of five male speakers for the
read text, while the part colored in orange represents the
correlation between spectrums of all the five female
speakers for the same text. The yellow part of Table III is
the spectral correlation between males and female speakers.
Male speakers
Text 1
the male speaker (64%). However, the average spectral
correlation for the male and for the female speaker is nearly
the same: 68% for the male and 69% for the female speaker.
From this, it can be concluded that spectral shape after
silence removal may be used for feature extraction in order
to perform speaker recognition.
1
1
2
0.53
0.53 0.54 0.59 0.55 0.44 0.43 0.32 0.38 0.31
3
0.54 0.56
4
0.59 0.61 0.60
5
0.55 0.55 0.50 0.57
6
0.44 0.43 0.49 0.50 0.39
7
0.43 0.35 0.44 0.46 0.34 0.63
8
0.32 0.40 0.43 0.46 0.37 0.52 0.50
9
0.38 0.45 0.51 0.51 0.39 0.66 0.58 0.58
1
0.56 0.61 0.55 0.43 0.35 0.40 0.45 0.36
1
0.60 0.50 0.49 0.44 0.43 0.51 0.37
1
0.57 0.50 0.46 0.46 0.51 0.43
1
0.39 0.34 0.37 0.39 0.42
1
0.63 0.52 0.66 0.44
1
0.50 0.58 0.42
1
0.58 0.59
1
10 0.31 0.36 0.37 0.43 0.42 0.44 0.42 0.59 0.51
0.51
1
From the Table III, it can be noticed that the average
spectral correlation of the different male speakers after
applying silence removal is about 56% while for different
female speakers is about 54%. The average correlation of
the male-female part is about 42%. From this, it can be
concluded that spectrums of the same gender are notably
correlated, whilst spectrums of different gender are less
C. Effect of Audio Signal Duration
For the purpose of the audio signal duration effect
investigation on the spectrum correlation, ten signals were
prepared by concatenating all the uttered texts for each
speaker. In such manner, each speaker is presented by a
larger signal. The spectrum of this signal after silence
removal is determined and the correlation between it and
original signals is calculated. The results are shown in Table
IV.
TABLE IV
the table, while the orange part serves to calculate the
female vs. female average correlation and the part in yellow
is used to calculate the male vs. female correlation. Results
of this analysis are presented in Fig. 3.
From the Fig. 3, it can be noticed that for higher duration
of analyzed signal the spectral correlation between speakers
is higher with silence removal than without it. Beside this,
from this figure it can be concluded that spectral correlation
after silence removal is higher to the speaker gender, while
by comparing audio signals that origin from the speakers of
different gender, it is concluded that the spectral correlation
is lower. These results can be used for determination of the
optimal speech recording duration in data training of speech
analysis datasets.
0.7
male vs. male
female vs female
male vs female
0.6
Correlation
correlated. This can be explained by the fact that the energy
of the audio origins from female speaker is concentrated
around 200 Hz and 400 Hz, while energy for male speaker is
distributed on larger frequency range. The average spectral
correlation of all male speakers for the case without silence
removal is about 30% and for all female speakers it is about
27%. These results confirm that the correlation decreases
significantly because of the parts of the signal having a low
energy (unsounded parts). These conclusions may be used
for feature extraction in order to perform gender recognition
using speech signal.
CORRELATION BETWEEN SPECTRUMS OF SIGNALS SUM AND ORIGINAL
ONES
0.5
0.4
0.3
0.2
1
Text1
0.77
Text2
0.75
Text3
0.81
Text4
0.72
Text5
0.70
2
0.76
0.80
0.78
0.81
0.77
3
0.81
0.73
0.75
0.76
0.76
4
0.85
0.84
0.81
0.77
0.72
5
0.73
0.79
0.67
0.81
0.80
6
0.82
0.69
0.68
0.66
0.64
7
0.79
0.78
0.71
0.80
0.68
8
0.78
0.71
0.76
0.70
0.78
9
0.81
0.68
0.75
0.79
0.70
10
0.74
0.82
0.77
0.70
0.83
Text 1
Text 3
Text 4
Text 5
All texts
Text 5
All texts
0.75
male vs male
female vs female
male vs female
0.7
0.65
0.6
0.55
0.5
0.45
0.4
Text 1
From the Table IV, it can be noticed that the correlation is
enhanced using longer speech signal duration. Analyzing
results from the Table I, it can be noticed that the average
correlation between the different texts of the speaker 3 (with
silence removal) is around 68%. On the other hand, the
average correlation of the same speaker calculated based on
Table IV is about 76%. From this, it can be concluded that if
the audio signal is longer, the spectral correlation after
silence removal is higher.
In order to support this conclusion, correlation matrices
for each of the five analyzed texts and for all concatenated
texts are calculated. From each obtained matrix, the average
correlation between spectrums of male speakers (male vs.
male), the average correlation between spectrums of female
speakers (female vs. female) and the average correlation
between speakers of different gender (male vs. female) are
calculated. Referring to the Table III the male vs. male
average correlation is computed based on the blue part of
Text 2
a)
Correlation
All texts
Speaker
Text 2
Text 3
Text 4
b)
Fig. 3. Average correlation for the different texts and speaker gender:
a) without silence removal and b) with silence removal.
IV. CONCLUSION
This paper presented a spectral analysis applied on
recently created database that consists of male and female
speech samples. Using the spectral correlation measure, in
this research it is analyzed the effect of unsounded audio
signal parts on the spectrum shape and it is shown that silent
parts strongly affects the results of a spectral analysis.
Beside this, in this research the spectrums obtained from
signals of the same speaker for different uttered texts are
compared and it is shown that they are strongly correlated.
Moreover, there is a significant correlation between
spectrums obtained from speakers of different gender
reading the same text. In this research is also analyzed the
effect of the audio signal duration on the spectrum
correlation. Obtained results show that for longer duration
signal, the spectral correlation is higher. These results may
be ground for future researches related to the speech signal
analysis. In future research, the problem of spectral-based
feature extraction for speaker identification will be
considered. Furthermore, the case of background noise,
music and distant speakers will be taken into account.
REFERENCES
[1]
[2]
[3]
[4]
[5]
L. R. Rabiner and R. W. Schafer, Theory and applications of digital
speech processing, NJ, USA: Pearson, 2011.
A. V. Oppenheim, Discrete-time signal processing, ND, India:
Pearson Education India, 1999.
L. R. Rabiner and R. W. Schafer, Digital processing of speech
signals, New Jersey, USA: Prentice-Hall, 1978.
K. Gopalan, T. R. Anderson, and E. J. Cupples, "A comparison of
speaker identification results using features based on cepstrum and
Fourier-Bessel expansion," IEEE Trans. on Speech and Audio Proc,
vol. 7, no. 3, pp. 289-294. Apr. 1999.
B. Imperl, Z. Kai, and B. Horvat, "A study of harmonic features for
the speaker recognition," Speech Communication, vol. 22, no. 4, pp.
385-402. Feb. 1997.
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
G.-J. Jang, T.-W. Lee, and Y.-H. Oh, "Learning statistically efficient
features for speaker recognition," Neurocomputing, vol. 49, no. 4, pp.
329-348. Jun. 2002.
T. Kinnunen and H. Li, "An overview of text-independent speaker
recognition: From features to supervectors," Speech Communication,
vol. 52, no. 1, pp. 12-40. May 2010.
G. Ruske and T. Schotola, "The efficiency of demisyllable
segmentation in the recognition of spoken words," in ICASSP'81.
IEEE International Conference on Acoustics, Speech, and Signal
Processing, NY, USA, vol. 1, pp. 971-974. 04-01-1981.
M. P. Kesarkar, "Feature extraction for speech recognition," M.Tech.
Credit Seminar Report, Bombay, India, 2003.
O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee, "Emotion recognition
by speech signals," European Conference on Speech Communication
and Technology. Geneva, Switzerland, interspeech. 09-01-2003.
K. Rakesh, S. Dutta, and K. Shama, "Gender Recognition using
speech processing techniques in LABVIEW," Internat. Journal of
Advs. in Engin. & Tech., vol. 1, no. 2, p. 51. Jul. 2011.
M. Hydari, M. R. Karami, and E. Nadernejad, "Speech Signals
Enhancement Using LPC Analysis based on Inverse Fourier
Methods," Contemporary Engineering Sciences, vol. 2, no. 1, pp. 115. Jan. 2009.
T. Giannakopoulos, "Study and application of acoustic information
for the detection of harmful content, and fusion with visual
information," PhD. Dissertation, IT Dept., NKA Univ., Athens,
Greece, 2009.
Download