Speaker Verification under Degraded Condition Sneha M.Powar, Dr. V.V.Patil

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 3- March 2016
Speaker Verification under Degraded
Condition
Sneha M.Powar, Dr. V.V.Patil
Electronics & Telecommunication Department, Dr. J. J. Magdum College of Engineering, Jaysingpur,
Kolhapur, India
H.O.D. Electronics Department, Dr. J. J. Magdum College of Engineering, Jysingpur, India
Abstract — Speaker verification is a process to
accept or reject the identity claim of a speaker by
comparing a set of measurements of the speaker’s
utterances with a reference set of measurements of
the utterance of the person whose identity is claimed.
Speaker verification (SV) systems provide good
performance when the speech signal is clean. In this
work the performance of SV for degraded condition
by considering White Gaussian Noise and Babble
noise. The accuracy of SV system falls significantly
under degraded condition. The proportion of noise
in clean speech varies considerably under different
environment condition, hence, in this work SV
system is also analyse for different SNR levels of
White Gaussian Noise. As the SNR level increases
the performance of the SV system increases. The
accuracy of SV system falls significantly under
degraded condition.
Keywords- VLR, VLROP, MFCC, GMM
I. INTRODUCTION
In our everyday lives there are many forms of
communication, for instance: body Language,
textual language, pictorial language and speech, etc.
However amongst those forms speech is always
regarded as the most powerful form of
communication. From the signal processing point of
view, speech can be characterized in terms of the
signal carrying message information. Speaker
verification (SV) is the process of determining
whether the speaker identity is who the person
claims to be. Different terms which have the same
definition as SV could be found in literature, such as
voice
verification,
voice
authentication,
speaker/talker authentication, talker verification. It
performs a one-to-one comparison (it is also called
binary decision) between the features of an input
voice and those of the claimed voice that is
registered in the system. The performance of the SV
system can be further improved by selecting only
those speech regions, based on the nature of speech
production, that are relatively more speaker
discriminative and less affected by various
degradations. This can be achieved using the
knowledge of vowel-like region (VLR) onset points
(VLROPs). VLROP helps in identifying VLRs
ISSN: 2231-5381
which include vowels, semivowels and diphthongs
that are high SNR regions from the production
perspective. Hence, they may be more speakers
discriminative. The state-of-art speaker verification
(SV) systems provide good performance when the
speech signal is of high quality and free from any
mismatch. Such a speech signal is treated as clean
speech in the present work. However, in most
practical operating conditions, the speech signal is
affected by different degradations like background
noise, reverberation, sensor mismatch, and channel
mismatch, resulting in degraded speech. In this
paper we consider two types of noise White
Gaussian Noise and Babble Noise for the analysis of
SV system under degraded condition.
II. DATABASE
For Speaker verification system we have created a
database in our own laboratory with the help of
microphone, headphone & digital voice recorder.
We developed the database in „Marathi‟ language.
For creation of database we collect data from 5 male
and 5 female speakers and total data collection is of
10 different speakers. We record 20 clips of each
speaker and each clip is of 1 minute duration. The
digitization of the recorded wave is 16 bit. The total
duration of database is of 3 hour 33 min. For
creating a database the Pratt tool is used. This tool
consists of object window, picture window & sound
editor window. For Windows and Linux users the
Pratt menu item appears at the top left position in the
object window while for Macintosh users this menu
item does not appear in the object window at all but
is placed left on the menu at the top of the display.
III. METHODOLOGY
A. Speaker Verification
Speaker verification (SV) is the process of
determining whether the speaker identity is who the
person claims to be. Different terms which have the
same definition as SV could be found in literature,
such as voice verification, voice authentication,
speaker/talker authentication, talker verification. It
performs a one-to-one comparison (it is also called
binary decision) between the features of an input
voice and those of the claimed voice that is
http://www.ijettjournal.org
Page 159
International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 3- March 2016
registered in the system. There are three main
components: Front-end Processing, Speaker
Modelling, and Pattern Matching. Front-end
processing is used to highlight the relevant features
and remove the irrelevant ones. After the first
component, we will get the feature vectors of the
speech signal. Pattern Matching between the claimed
speaker model registered in the database and the
imposter model will be performed then, if the match
is above a certain threshold, the Identity claim is
verified. Using a high threshold, system gets high
safety and prevents impostors to be accepted, but in
the mean while it also takes the risk of rejecting the
genuine person, and vice versa.
Fig 1 Basic structure of Speaker Verification
B. Feature Extraction
Front End Processing is nothing but extracting the
feature vector. The extraction of the best parametric
representation of acoustic signals is an important
task to produce a better recognition performance.
The efficiency of this phase is important for the next
phase since it affects its behaviour. MFCC is based
on human hearing perceptions which cannot
perceive frequencies over 1Khz. In other words, in
MFCC is based on known variation of the human
ear‟s critical bandwidth with frequency. MFCC has
two types of filter which are spaced linearly at low
frequency below 1000 Hz and logarithmic spacing
above 1000Hz. A subjective pitch is present on Mel
Frequency Scale to capture important characteristic
of phonetic in speech. The overall process of the
MFCC is shown in Fig2
Fig 2 MFCC Computation Block Diagram
MFCC consists of seven computational steps. Each
step has its function and mathematical approaches.
ISSN: 2231-5381
1 Pre–emphasis
This step processes the passing of signal through
a filter which emphasizes higher frequencies. This
process will increase the energy of signal at higher
frequency.
2 Framing
The process of segmenting the speech samples
obtained from analog to digital conversion (ADC)
into a small frame with the length within the range
of 20 to 40 msec.
3 Hamming windowing
Hamming window is used as window shape by
considering the next block in feature extraction
processing chain and integrates all the closest
frequency lines.
4 Fast Fourier Transform
To convert each frame of N samples from time
domain into frequency domain.
5 Mel Filter Bank Processing
The frequencies range in FFT spectrum is very
wide and voice signal does not follow the linear
scale.
6 Discrete Cosine Transform
This is the process to convert the log Mel
spectrum into time domain using Discrete Cosine
Transform (DCT). The result of the conversion is
called Mel Frequency Cepstrum Coefficient. The set
of coefficient is called acoustic vectors.
7 Delta Energy and Delta Spectrum
The voice signal and the frames changes, such
as the slope of a formant at its transitions. Therefore,
there is a need to add features related to the change
in cepstral features over time. 13 delta or velocity
features (12 cepstral features plus energy), and 39
features a double delta or acceleration feature are
added.
C. Speaker Modelling
For Modelling we use Gaussian Mixture Model
(GMM). A GMM is a parametric probability density
function represented as a weighted sum of Gaussian
component densities. GMMs are commonly used as
a parametric model of the probability distribution of
continuous measurements or features in a biometric
system, such as vocal-tract related spectral features
in a speaker recognition system. GMM parameters
are estimated from training data using the iterative
Expectation-Maximization (EM) algorithm or
Maximum A Posterior (MAP) estimation from a
well-trained prior model. GMMs are often used in
verification systems, most notably in speaker
recognition systems, due to their capability of
representing a large class of sample distributions.
One of the powerful attributes of the GMM is its
ability to form smooth approximations to arbitrarily
shaped densities. The classical uni-modal Gaussian
http://www.ijettjournal.org
Page 160
International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 3- March 2016
model represents feature distributions by a position
(mean vector) and a elliptic shape (covariance
matrix) and a vector quantizer (VQ) or nearest
neighbour model represents a distribution by a
discrete set of characteristic templates . A GMM acts
as a hybrid between these two models by using a
discrete set of Gaussian functions, each with their
own mean and covariance matrix, to allow a better
modelling capability. The GMM not only provides a
smooth overall distribution fit, its components also
clearly detail the multi-modal nature of the density.
IV. RESULTS
We have tested the SV system by adding two types
of noises White Gaussian Noise and Multitalker
Babble noise. We have tested the SV system with
different SNR values of the White Gaussian Noise.
The following TABLE I shows the average accuracy
of speaker having different SNR values.
TABLE II
AVERAGE ACCURACY IN PERCENTAGE
SNR
Value
Accuracy in %
20db
61.7
25db
65.9
30db
69.3
35db
72.5
40db
75.5
The following Fig shows the graph of SNR vs.
Speaker accuracy in percentage. As SNR value
increases the performance of SV system also
increases.
A. Analysis with White Gaussian Noise
We analyse the performance of SV system by adding
the White Gaussian Noise with SNR=30db.
We also draw the confusion matrix which shows the
percentage of speaker‟s accuracy. In this case we use
noisy data in training and testing also speaker 6 and
which is a Female speaker shows the highest
accuracy and it is 77.0% as shown in TABLE II
Speaker 3 shows the 62.8 % accuracy and which is a
female speaker and it mostly confuse with speaker 2
which is also female speaker and confusion is same
in reverse case also. In case of speaker 1, it gives
65.7% accuracy and it mostly confuses with speaker
5 and both are male speakers and confusion is same
in reverse case also. In short with the help of above
observations we understand that the maximum
confusion occur in speaker of same gender. Also we
can conclude from the below TABLE II is Accuracy
decreases when data is noisy.
B. Analysis with Babble Noise
We analyse the performance of SV system by adding
Babble Noise. We also draw the confusion matrix
which shows the percentage of speaker‟s accuracy.
In this case speaker 9 and which is a male speaker
shows the highest accuracy and it is 79.1% as shown
in TABLE III Speaker 3 shows the 71.2 % accuracy
and which is a female speaker and it mostly confuse
with speaker 6 which is also female speaker and
confusion is same in reverse case also. In case of
speaker 1, it gives 75.3% accuracy and it mostly
confuses with speaker 5 and both are male speakers
and confusion is same in reverse case also. In case of
speaker 7, it gives 76.4% accuracy and it mostly
confuses with speaker 10 and both are male speakers
and confusion is same in reverse case also. In short
with the help of above observations we understand
that the maximum confusion occur in speaker of
same gender. Also we can conclude from the below
TABLE III is SV System‟s accuracy decreases when
data is noisy
Fig 3 Graph of SNR vs. Identification Accuracy
ISSN: 2231-5381
http://www.ijettjournal.org
Page 161
International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 3- March 2016
TABLE III
CONFUSION MATRIX WITH WHITE GAUSSIAN NOISE HAVING
SNR=30DB
Actual
SPK
SPK
Read
As
SPK1
SPK2
SPK3
SPK4
SPK5
SPK6
SPK7
SPK8
SPK9
SPK10
SPK1
SPK2
SPK3
SPK4
SPK5
SPK6
SPK7
SPK8
SPK9
65.7
6.2
4.4
3.2
11.4
0.1
3.2
0.5
2.2
7.6
67.0
7.4
6.6
4.3
0.8
1.1
2.1
1.6
8.0
8.1
62.8
5.9
4.7
1.1
1.6
2.7
3.2
8.2
9.2
8.1
60.4
4.7
0.9
1.5
2.4
2.3
11.6
5.2
4.3
2.9
67.1
0.1
3.4
0.5
1.7
2.1
6.7
3.5
2.5
1.2
77.0
0.7
2.7
2.3
6.8
3.7
3.0
2.2
5.4
0.4
72.3
0.6
2.2
3.4
6.5
5.7
4.3
2.0
1.8
0.7
73.5
1.0
5.3
4.4
3.8
2.4
2.8
0.6
1.9
0.6
76.4
6.4
4.3
3.2
2.3
5.0
0.5
3.5
0.6
2.3
SPK10
2.7
1.0
1.4
1.3
2.8
0.8
3.3
0.6
1.4
71.4
TABLE IVI
CONFUSION MATRIX WITH BABBLE NOISE
Actual
SPK
SPK
Read
As
SPK1
SPK1
SPK2
SPK3
SPK4
SPK5
SPK6
SPK7
SPK8
SPK9
SPK10
3.2
2.6
3.1
3.1
2.1
2.8
2.5
2.2
2.8
SPK2
71.3
3.2
3.4
2.6
2.3
2.5
2.4
2.2
2.5
3.3
2.0
71.8
3.2
2.4
2.6
SPK3
SPK4
71.2
1.9
3.3
9.9
1.6
2.6
1.5
2.6
1.7
2.4
1.4
2.5
1.7
SPK5
SPK6
4.1
2.4
3.4
2.9
2.8
8.0
70.6
3.4
2.7
2.6
1.6
76.4
2.0
2.3
2.4
2.4
3.1
1.9
2.9
2.5
SPK7
SPK8
3.7
2.5
3.3
2.8
2.6
2.4
3.2
2.8
3.0
2.1
71.2
2.5
2.4
2.9
2.3
76.4
2.4
2.5
3.2
2.6
SPK9
3.6
3.1
2.9
3.5
3.1
2.4
3.0
77.9
2.5
2.7
1.9
79.1
2.9
SPK10
3.5
3.4
2.7
3.4
2.9
2.8
3.3
2.8
2.6
75.9
ISSN: 2231-5381
http://www.ijettjournal.org
Page 162
International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 3- March 2016
V. DISCUSSION
Speaker verification is a process to accept or reject
the identity claim of a speaker by comparing a set of
measurements of the speaker‟s utterances with a
reference set of measurements of the utterance of the
person whose identity is claimed. In speaker
verification, a person makes an identity claim. The
overall aim to be achieved in this work is that, the
performance of the SV system falls under degraded
condition. Here MFCC gives the greater
performance; it recognizes the all speakers very
effectively. We add noise in the clean data the
performance of the SV system decreases it shows the
69.39% accuracy having SNR value is 30db..As the
SNR level increases the performance of the SV
system increases. We tested the performance of the
system by adding one another type of noise i.e.
Babble Noise. It gives the 74.23% accuracy. The
performance of SV system is better while adding the
babble noise than White Gaussian noise because the
characteristics of the noises.
[8] J. Ming, T. J. Hazen, J. R. Glass, and D. A. Reynolds, “Robust
speaker recognition in noisy conditions,” IEEE Trans. Audio,
Speech, Lang. Process., vol. 15, no. 5, pp. 1711–1723, Jul. 2007.
VI FUTURE SCOPE
The research presented in this work focused on the
speaker verification task which is used to accept or
reject the identity claim of a speaker by comparing a
set of measurements of the speaker‟s utterances with
a reference set of measurements of the utterance of
the person whose identity is claimed.
In this work we use only MFCC feature. We can
calculate the other feature like LP residual Pitch for
the development of speaker verification. In this
research we analyze the SV system performance by
using AWGN and Babble noise. We also analyze the
SV system performance by using Factory-I noise,
Pink noise, Impulsive noise.
REFERENCES
[1] P. Krishnamoorthy and S. R. M. Prasanna, “Enhancement of
noisy speech by temporal and spectral processing,” Speech
Commun., vol. 53, pp. 154–174, Feb. 2011.
[2] B. Yegnanarayana and P. S. Murthy, “Enhancement of
reverberant speech using LP residual signal,” IEEE Trans. Speech
Audio Process., vol. 8, no. 3, pp. 267–281, May 2000.
[3] P. Krishnamoorthy and S. R. M. Prasanna, “Reverberant
speech enhancement by temporal and spectral processing,” IEEE
Trans. Audio, Speech, Lang. Process., vol. 17, no. 2, pp. 253–266,
Feb. 2009.
[4] A. N. Khan and B. Yegnanarayana,“Vowel onset point based
variable frame rate analysis for speech recognition,” in Proc. Int.
Conf. Intell. Sensing Inf. Process., Jan. 2005, pp. 392–394.
[5] Xavier Anguera Miro, “Robust Speaker Diarization for
Meetings, (2006)” Speech Processing Group Department of
Signal Theory and Communications Universitat Polit`ecnica de
Catalunya Barcelona.
[6] Miller, B., Vital signs of identity, IEEE Spectrum, 22–30, Feb.
1994.
[7] T. Kinnunen and H. Li, “An overview of text-independent
speaker recognition: From features to supervectors,” Speech
Commun., vol. 52, pp. 12–40, Jan. 2010.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 163
Download