International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 3- March 2016 Speaker Verification under Degraded Condition Sneha M.Powar, Dr. V.V.Patil Electronics & Telecommunication Department, Dr. J. J. Magdum College of Engineering, Jaysingpur, Kolhapur, India H.O.D. Electronics Department, Dr. J. J. Magdum College of Engineering, Jysingpur, India Abstract — Speaker verification is a process to accept or reject the identity claim of a speaker by comparing a set of measurements of the speaker’s utterances with a reference set of measurements of the utterance of the person whose identity is claimed. Speaker verification (SV) systems provide good performance when the speech signal is clean. In this work the performance of SV for degraded condition by considering White Gaussian Noise and Babble noise. The accuracy of SV system falls significantly under degraded condition. The proportion of noise in clean speech varies considerably under different environment condition, hence, in this work SV system is also analyse for different SNR levels of White Gaussian Noise. As the SNR level increases the performance of the SV system increases. The accuracy of SV system falls significantly under degraded condition. Keywords- VLR, VLROP, MFCC, GMM I. INTRODUCTION In our everyday lives there are many forms of communication, for instance: body Language, textual language, pictorial language and speech, etc. However amongst those forms speech is always regarded as the most powerful form of communication. From the signal processing point of view, speech can be characterized in terms of the signal carrying message information. Speaker verification (SV) is the process of determining whether the speaker identity is who the person claims to be. Different terms which have the same definition as SV could be found in literature, such as voice verification, voice authentication, speaker/talker authentication, talker verification. It performs a one-to-one comparison (it is also called binary decision) between the features of an input voice and those of the claimed voice that is registered in the system. The performance of the SV system can be further improved by selecting only those speech regions, based on the nature of speech production, that are relatively more speaker discriminative and less affected by various degradations. This can be achieved using the knowledge of vowel-like region (VLR) onset points (VLROPs). VLROP helps in identifying VLRs ISSN: 2231-5381 which include vowels, semivowels and diphthongs that are high SNR regions from the production perspective. Hence, they may be more speakers discriminative. The state-of-art speaker verification (SV) systems provide good performance when the speech signal is of high quality and free from any mismatch. Such a speech signal is treated as clean speech in the present work. However, in most practical operating conditions, the speech signal is affected by different degradations like background noise, reverberation, sensor mismatch, and channel mismatch, resulting in degraded speech. In this paper we consider two types of noise White Gaussian Noise and Babble Noise for the analysis of SV system under degraded condition. II. DATABASE For Speaker verification system we have created a database in our own laboratory with the help of microphone, headphone & digital voice recorder. We developed the database in „Marathi‟ language. For creation of database we collect data from 5 male and 5 female speakers and total data collection is of 10 different speakers. We record 20 clips of each speaker and each clip is of 1 minute duration. The digitization of the recorded wave is 16 bit. The total duration of database is of 3 hour 33 min. For creating a database the Pratt tool is used. This tool consists of object window, picture window & sound editor window. For Windows and Linux users the Pratt menu item appears at the top left position in the object window while for Macintosh users this menu item does not appear in the object window at all but is placed left on the menu at the top of the display. III. METHODOLOGY A. Speaker Verification Speaker verification (SV) is the process of determining whether the speaker identity is who the person claims to be. Different terms which have the same definition as SV could be found in literature, such as voice verification, voice authentication, speaker/talker authentication, talker verification. It performs a one-to-one comparison (it is also called binary decision) between the features of an input voice and those of the claimed voice that is http://www.ijettjournal.org Page 159 International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 3- March 2016 registered in the system. There are three main components: Front-end Processing, Speaker Modelling, and Pattern Matching. Front-end processing is used to highlight the relevant features and remove the irrelevant ones. After the first component, we will get the feature vectors of the speech signal. Pattern Matching between the claimed speaker model registered in the database and the imposter model will be performed then, if the match is above a certain threshold, the Identity claim is verified. Using a high threshold, system gets high safety and prevents impostors to be accepted, but in the mean while it also takes the risk of rejecting the genuine person, and vice versa. Fig 1 Basic structure of Speaker Verification B. Feature Extraction Front End Processing is nothing but extracting the feature vector. The extraction of the best parametric representation of acoustic signals is an important task to produce a better recognition performance. The efficiency of this phase is important for the next phase since it affects its behaviour. MFCC is based on human hearing perceptions which cannot perceive frequencies over 1Khz. In other words, in MFCC is based on known variation of the human ear‟s critical bandwidth with frequency. MFCC has two types of filter which are spaced linearly at low frequency below 1000 Hz and logarithmic spacing above 1000Hz. A subjective pitch is present on Mel Frequency Scale to capture important characteristic of phonetic in speech. The overall process of the MFCC is shown in Fig2 Fig 2 MFCC Computation Block Diagram MFCC consists of seven computational steps. Each step has its function and mathematical approaches. ISSN: 2231-5381 1 Pre–emphasis This step processes the passing of signal through a filter which emphasizes higher frequencies. This process will increase the energy of signal at higher frequency. 2 Framing The process of segmenting the speech samples obtained from analog to digital conversion (ADC) into a small frame with the length within the range of 20 to 40 msec. 3 Hamming windowing Hamming window is used as window shape by considering the next block in feature extraction processing chain and integrates all the closest frequency lines. 4 Fast Fourier Transform To convert each frame of N samples from time domain into frequency domain. 5 Mel Filter Bank Processing The frequencies range in FFT spectrum is very wide and voice signal does not follow the linear scale. 6 Discrete Cosine Transform This is the process to convert the log Mel spectrum into time domain using Discrete Cosine Transform (DCT). The result of the conversion is called Mel Frequency Cepstrum Coefficient. The set of coefficient is called acoustic vectors. 7 Delta Energy and Delta Spectrum The voice signal and the frames changes, such as the slope of a formant at its transitions. Therefore, there is a need to add features related to the change in cepstral features over time. 13 delta or velocity features (12 cepstral features plus energy), and 39 features a double delta or acceleration feature are added. C. Speaker Modelling For Modelling we use Gaussian Mixture Model (GMM). A GMM is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs are commonly used as a parametric model of the probability distribution of continuous measurements or features in a biometric system, such as vocal-tract related spectral features in a speaker recognition system. GMM parameters are estimated from training data using the iterative Expectation-Maximization (EM) algorithm or Maximum A Posterior (MAP) estimation from a well-trained prior model. GMMs are often used in verification systems, most notably in speaker recognition systems, due to their capability of representing a large class of sample distributions. One of the powerful attributes of the GMM is its ability to form smooth approximations to arbitrarily shaped densities. The classical uni-modal Gaussian http://www.ijettjournal.org Page 160 International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 3- March 2016 model represents feature distributions by a position (mean vector) and a elliptic shape (covariance matrix) and a vector quantizer (VQ) or nearest neighbour model represents a distribution by a discrete set of characteristic templates . A GMM acts as a hybrid between these two models by using a discrete set of Gaussian functions, each with their own mean and covariance matrix, to allow a better modelling capability. The GMM not only provides a smooth overall distribution fit, its components also clearly detail the multi-modal nature of the density. IV. RESULTS We have tested the SV system by adding two types of noises White Gaussian Noise and Multitalker Babble noise. We have tested the SV system with different SNR values of the White Gaussian Noise. The following TABLE I shows the average accuracy of speaker having different SNR values. TABLE II AVERAGE ACCURACY IN PERCENTAGE SNR Value Accuracy in % 20db 61.7 25db 65.9 30db 69.3 35db 72.5 40db 75.5 The following Fig shows the graph of SNR vs. Speaker accuracy in percentage. As SNR value increases the performance of SV system also increases. A. Analysis with White Gaussian Noise We analyse the performance of SV system by adding the White Gaussian Noise with SNR=30db. We also draw the confusion matrix which shows the percentage of speaker‟s accuracy. In this case we use noisy data in training and testing also speaker 6 and which is a Female speaker shows the highest accuracy and it is 77.0% as shown in TABLE II Speaker 3 shows the 62.8 % accuracy and which is a female speaker and it mostly confuse with speaker 2 which is also female speaker and confusion is same in reverse case also. In case of speaker 1, it gives 65.7% accuracy and it mostly confuses with speaker 5 and both are male speakers and confusion is same in reverse case also. In short with the help of above observations we understand that the maximum confusion occur in speaker of same gender. Also we can conclude from the below TABLE II is Accuracy decreases when data is noisy. B. Analysis with Babble Noise We analyse the performance of SV system by adding Babble Noise. We also draw the confusion matrix which shows the percentage of speaker‟s accuracy. In this case speaker 9 and which is a male speaker shows the highest accuracy and it is 79.1% as shown in TABLE III Speaker 3 shows the 71.2 % accuracy and which is a female speaker and it mostly confuse with speaker 6 which is also female speaker and confusion is same in reverse case also. In case of speaker 1, it gives 75.3% accuracy and it mostly confuses with speaker 5 and both are male speakers and confusion is same in reverse case also. In case of speaker 7, it gives 76.4% accuracy and it mostly confuses with speaker 10 and both are male speakers and confusion is same in reverse case also. In short with the help of above observations we understand that the maximum confusion occur in speaker of same gender. Also we can conclude from the below TABLE III is SV System‟s accuracy decreases when data is noisy Fig 3 Graph of SNR vs. Identification Accuracy ISSN: 2231-5381 http://www.ijettjournal.org Page 161 International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 3- March 2016 TABLE III CONFUSION MATRIX WITH WHITE GAUSSIAN NOISE HAVING SNR=30DB Actual SPK SPK Read As SPK1 SPK2 SPK3 SPK4 SPK5 SPK6 SPK7 SPK8 SPK9 SPK10 SPK1 SPK2 SPK3 SPK4 SPK5 SPK6 SPK7 SPK8 SPK9 65.7 6.2 4.4 3.2 11.4 0.1 3.2 0.5 2.2 7.6 67.0 7.4 6.6 4.3 0.8 1.1 2.1 1.6 8.0 8.1 62.8 5.9 4.7 1.1 1.6 2.7 3.2 8.2 9.2 8.1 60.4 4.7 0.9 1.5 2.4 2.3 11.6 5.2 4.3 2.9 67.1 0.1 3.4 0.5 1.7 2.1 6.7 3.5 2.5 1.2 77.0 0.7 2.7 2.3 6.8 3.7 3.0 2.2 5.4 0.4 72.3 0.6 2.2 3.4 6.5 5.7 4.3 2.0 1.8 0.7 73.5 1.0 5.3 4.4 3.8 2.4 2.8 0.6 1.9 0.6 76.4 6.4 4.3 3.2 2.3 5.0 0.5 3.5 0.6 2.3 SPK10 2.7 1.0 1.4 1.3 2.8 0.8 3.3 0.6 1.4 71.4 TABLE IVI CONFUSION MATRIX WITH BABBLE NOISE Actual SPK SPK Read As SPK1 SPK1 SPK2 SPK3 SPK4 SPK5 SPK6 SPK7 SPK8 SPK9 SPK10 3.2 2.6 3.1 3.1 2.1 2.8 2.5 2.2 2.8 SPK2 71.3 3.2 3.4 2.6 2.3 2.5 2.4 2.2 2.5 3.3 2.0 71.8 3.2 2.4 2.6 SPK3 SPK4 71.2 1.9 3.3 9.9 1.6 2.6 1.5 2.6 1.7 2.4 1.4 2.5 1.7 SPK5 SPK6 4.1 2.4 3.4 2.9 2.8 8.0 70.6 3.4 2.7 2.6 1.6 76.4 2.0 2.3 2.4 2.4 3.1 1.9 2.9 2.5 SPK7 SPK8 3.7 2.5 3.3 2.8 2.6 2.4 3.2 2.8 3.0 2.1 71.2 2.5 2.4 2.9 2.3 76.4 2.4 2.5 3.2 2.6 SPK9 3.6 3.1 2.9 3.5 3.1 2.4 3.0 77.9 2.5 2.7 1.9 79.1 2.9 SPK10 3.5 3.4 2.7 3.4 2.9 2.8 3.3 2.8 2.6 75.9 ISSN: 2231-5381 http://www.ijettjournal.org Page 162 International Journal of Engineering Trends and Technology (IJETT) – Volume 33 Number 3- March 2016 V. DISCUSSION Speaker verification is a process to accept or reject the identity claim of a speaker by comparing a set of measurements of the speaker‟s utterances with a reference set of measurements of the utterance of the person whose identity is claimed. In speaker verification, a person makes an identity claim. The overall aim to be achieved in this work is that, the performance of the SV system falls under degraded condition. Here MFCC gives the greater performance; it recognizes the all speakers very effectively. We add noise in the clean data the performance of the SV system decreases it shows the 69.39% accuracy having SNR value is 30db..As the SNR level increases the performance of the SV system increases. We tested the performance of the system by adding one another type of noise i.e. Babble Noise. It gives the 74.23% accuracy. The performance of SV system is better while adding the babble noise than White Gaussian noise because the characteristics of the noises. [8] J. Ming, T. J. Hazen, J. R. Glass, and D. A. Reynolds, “Robust speaker recognition in noisy conditions,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 5, pp. 1711–1723, Jul. 2007. VI FUTURE SCOPE The research presented in this work focused on the speaker verification task which is used to accept or reject the identity claim of a speaker by comparing a set of measurements of the speaker‟s utterances with a reference set of measurements of the utterance of the person whose identity is claimed. In this work we use only MFCC feature. We can calculate the other feature like LP residual Pitch for the development of speaker verification. In this research we analyze the SV system performance by using AWGN and Babble noise. We also analyze the SV system performance by using Factory-I noise, Pink noise, Impulsive noise. REFERENCES [1] P. Krishnamoorthy and S. R. M. Prasanna, “Enhancement of noisy speech by temporal and spectral processing,” Speech Commun., vol. 53, pp. 154–174, Feb. 2011. [2] B. Yegnanarayana and P. S. Murthy, “Enhancement of reverberant speech using LP residual signal,” IEEE Trans. Speech Audio Process., vol. 8, no. 3, pp. 267–281, May 2000. [3] P. Krishnamoorthy and S. R. M. Prasanna, “Reverberant speech enhancement by temporal and spectral processing,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 2, pp. 253–266, Feb. 2009. [4] A. N. Khan and B. Yegnanarayana,“Vowel onset point based variable frame rate analysis for speech recognition,” in Proc. Int. Conf. Intell. Sensing Inf. Process., Jan. 2005, pp. 392–394. [5] Xavier Anguera Miro, “Robust Speaker Diarization for Meetings, (2006)” Speech Processing Group Department of Signal Theory and Communications Universitat Polit`ecnica de Catalunya Barcelona. [6] Miller, B., Vital signs of identity, IEEE Spectrum, 22–30, Feb. 1994. [7] T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: From features to supervectors,” Speech Commun., vol. 52, pp. 12–40, Jan. 2010. ISSN: 2231-5381 http://www.ijettjournal.org Page 163