SIM UNIVERSITY SCHOOL OF SCIENCE AND TECHNOLOGY VOICE PRINT ANALYSIS FOR SPEAKER RECOGNITION STUDENT SUPERVISOR PROJECT CODE : THANG WEE KEONG (M0606042) : DR YUAN ZHONG XUAN : JAN09/BEHE/57 A project report submitted to SIM University in partial fulfilment of the requirements for the degree of Bachelor of Engineering of Electronics NOV 2009 Acknowledgement I would like to express my heartfelt gratitude to my project supervisor, Dr Yuan Zhong Xuan for his patient guidance, invaluable advice and supervision. I would like to say thanks to him for being so accommodating to adjust his busy schedule to meet me. Dr Yuan has selflessly imparted his expertise and years of experience to me throughout the whole duration of the Final Year Project. Without his help, I would not have been able to complete my project. I would also like to extend my gratitude to all my UniSIM lecturers that have taught me throughout my university days for the knowledge and guidance imparted to me. Without them, the educational experience in UniSIM would not have been so enjoyable and enriching. I would also like to express warm appreciation to my colleagues and friends who have been so accommodating and helpful in my pursue for academic excellence. Last but not least, I would like to extend my heartfelt thanks to my family and my partner for their unconditional love, accommodation and constant moral support throughout my academic years. ENG499 CAPSTONE PROJECT REPORT i Abstract A person's voice contains various parameters that convey information such as emotion, gender, attitude, health and identity. This thesis talks about speaker recognition which deals with the subject of identifying a person based on their unique voiceprint present in their speech data. Pre-processing of the speech signal is performed before voice feature extraction. This process ensures the voice feature extraction contains accurate information that conveys the identity of the speaker. Voice feature extraction methods such as Linear Predictive Coding (LPC), Linear Predictive Cepstral Coefficients (LPCC) and MelFrequency Cepstral Coefficients (MFCC) are analysed and evaluated for their suitability for use in speaker recognition tasks. A new method which combined LPCC and MFCC (LPCC+MFCC) using fusion output was proposed and evaluated together with the different voice feature extraction methods. The speaker model for all the methods was computed using Vector Quantization- Linde, Buzo and Gray (VQ-LBG) method. Individual modelling and comparison for LPCC and MFCC is used for the LPCC+MFCC method. The similarity scores for both methods are then combined for identification decision. The results show that this method is better or at least comparable to the traditional methods such as LPCC and MFCC without incurring high computation costs that will compromise the performance of the speaker recognition tasks. ENG499 CAPSTONE PROJECT REPORT ii Contents Acknowledgement ................................................................................................................. i Abstract .................................................................................................................................ii List of Figures ....................................................................................................................... v List of Tables ......................................................................................................................vii 1. Introduction ................................................................................................................... 1 1.1 Development of speaker recognition systems ........................................................... 2 1.2 Project Objectives ...................................................................................................... 3 1.3 Project Scope ............................................................................................................. 3 1.4 Summary of Report.................................................................................................... 4 2. Literature review ........................................................................................................... 5 2.1 Concepts of speaker recognition ................................................................................ 6 2.2 Identification and Verification tasks .......................................................................... 6 2.2.1 Text dependent and Text independent Task .............................................................. 6 2.2.2 Open and closed-set Identification ............................................................................ 7 2.3 Phases of Speaker Identification ................................................................................ 7 2.4 Pre-processing techniques ......................................................................................... 8 2.4.1 Analogue-To-Digital (A/D) ....................................................................................... 8 2.4.2 End-Point Detection................................................................................................... 9 2.4.3 Pre-Emphasis ........................................................................................................... 10 2.4.4 Speech analysis technique or framing ..................................................................... 10 2.5 Voice features extraction ........................................................................................ 11 2.5.1 Linear Predictive Coding (LPC) .............................................................................. 11 2.5.2 Linear Predictive Cepstral Coefficients ................................................................... 13 2.5.3 Mel-Frequency Cepstral Coefficients (MFCC) ....................................................... 14 2.6 Summary of feature extraction techniques .............................................................. 16 2.7 Speaker Modeling .................................................................................................... 17 2.7.1 Template Matching .................................................................................................. 17 2.7.1.2 Dynamic Time Wrapping (DTW) ........................................................................ 17 2.7.2 Vector Quantization source modelling .................................................................... 18 2.7.3 Stochastic Models .................................................................................................... 21 2.7.3.1 Hidden Markov Model ......................................................................................... 21 2.8 Neural Networks ...................................................................................................... 22 2.9 Summary .................................................................................................................. 23 3. 3.1 4. Project Plan.................................................................................................................. 25 Gantt Chart............................................................................................................... 28 Development of the speaker recognition system ......................................................... 30 ENG499 CAPSTONE PROJECT REPORT iii 4.1 DC Offset Removal ................................................................................................. 31 4.2 End-point detection .................................................................................................. 31 4.3 Pre-Emphasis ........................................................................................................... 32 4.4 Framing/Windowing ................................................................................................ 34 4.5 Feature extraction .................................................................................................... 34 4.5.1 Linear Prediction Coefficients (LPC) ...................................................................... 35 4.5.2 Linear Prediction Cepstral Coefficients (LPCC) ..................................................... 35 4.5.3 Mel-Frequency Cepstral Coefficients (MFCC) ....................................................... 37 4.6 Speaker modelling ................................................................................................... 38 5. Experimental Setup ..................................................................................................... 39 6. Results ......................................................................................................................... 40 6.1 Linear Predictive Coefficients ................................................................................. 40 6.1.2 Conclusion of LPC .................................................................................................. 42 6.2 Linear Predictive Cepstral Coefficients ................................................................... 43 6.2.1 Conclusion of LPCC ................................................................................................ 45 6.3 Mel-Frequency Cepstral Coefficients ...................................................................... 46 6.3.1 Conclusion of MFCC ............................................................................................... 48 6.4 LPCC+MFCC .......................................................................................................... 50 6.4.2 Comparison of LPCC+MFCC vs Other Methods ................................................... 51 7. 7.1 8. Conclusion ................................................................................................................... 53 Recommendations for further study ........................................................................ 54 Critical review and reflections .................................................................................... 55 Appendix .............................................................................................................................60 A.1 Identification rates using codebook size of 32.......................................................... 60 A.2 Identification rates using codebook size of 64...........................................................60 A.3 Identification rates using codebook size of 128.........................................................61 A.4 Identification rates using MFCC+LPCC (Codebook size 32)....................................61 A.5 FYP.fig ..................................................................................................................... 62 A.6 Feature_selection.fig .................................................................................................62 A.7 User_Identified.fig ....................................................................................................63 A.8 Voice_recording_FYP.fig .........................................................................................63 ENG499 CAPSTONE PROJECT REPORT iv List of Figures 2.1 Speaker recognition processing tree...........................................................................6 2.2 Block diagram of automatic speaker identification system........................................8 2.3 Block diagram of the pre-processing stages...............................................................8 2.4 Speech Analysis Filter................................................................................................12 2.5 Speech Synthesis Filter..............................................................................................12 2.6 Block diagram of Linear Predictive Cepstral Coefficient..........................................13 2.7 Mel Scale plot.............................................................................................................14 2.8 Block diagram of Mel-Frequency Cepstral Coefficient.............................................15 2.9 Block diagram of speaker decision............................................................................17 2.10 2-dimensional VQ with 32 regions............................................................................18 2.11 K-means clusters........................................................................................................19 2.12 Steps for LBG algorithm............................................................................................20 2.13 Probabilistic parameters of a hidden Markov............................................................21 4.1 Flowchart of prototype speaker recognition system...................................................30 4.2 Flowchart of end-point detection for prototype speaker recognition system.............31 4.3 Speech data before and after End-Point Detection....................................................32 4.4 Frequency response and z-plane plot of the pre-emphasis filter with = 0.95.........33 4.5 Speech data before and after End-Point Detection....................................................33 4.6 Power spectrum before and after End-Point Detection..............................................33 4.7 Speech frame before and after Hamming window....................................................34 4.8 Flowchart of computation of LPCC...........................................................................36 4.9 Flowchart of computation of MFCC..........................................................................37 6.1 Results of LPC for codebook size of 32.....................................................................40 6.2 Results of LPC for codebook size of 64.....................................................................41 6.3 Results of LPC for codebook size of 128...................................................................41 6.4 Comparison of LPC using different codebook sizes..................................................42 6.5 Results of LPCC for codebook size of 32..................................................................43 6.6 Results of LPCC for codebook size of 64..................................................................44 6.7 Results of LPCC for codebook size of 128................................................................44 6.8 Comparison of LPCC using different codebook sizes..............................................45 6.9 Results of MFCC for codebook size of 32.................................................................46 6.10 Results of MFCC for codebook size of 64................................................................47 ENG499 CAPSTONE PROJECT REPORT v 6.11 Results of MFCC for codebook size of 128...............................................................47 6.12 Comparison of MFCC using different codebook sizes..............................................48 6.13 Block diagram of proposed system (LPCC+MFCC).................................................50 6.14 Overview of recognition rates using different codebook sizes..................................51 ENG499 CAPSTONE PROJECT REPORT vi List of Tables 2.1 Comparison of features extraction in terms of filtering techniques..................................16 2.2 Comparison of criteria of feature extraction techniques...................................................16 2.3 Comparison of different feature extraction and modelling techniques.............................23 3.1 Details of project plan.......................................................................................................29 5.1 Voiceprint test parameters……………………………………………………………....39 ENG499 CAPSTONE PROJECT REPORT vii 1. Introduction In everyday life, there is a need for controlled access to certain information /places for security purposes. Typical such secure identification system requires a person to use a cardkey (something that the user has) or to enter a pin (something that the user knows) in order to gain access to the system. However, the two methods mentioned above have some shortcomings as the access control used can be stolen, lost, misused or forgotten. The desire for a more secure identification system (whereby the physical human self is the key to access the system) led to the development of biometric recognition systems. Biometric recognition systems make use of features that is unique to each individual, which is not duplicable or transferable. There are two characteristics of biometric features. Behavioural characteristics such as voice and signature are the result of body part movements. In the case of voice it merely reflects the physical properties of the voice production organs. The articulatory process and the subsequent speech produced are never exactly identical even when the same person utters the same words. Physiological characteristics refer to the actual physical properties of a person such as fingerprint, iris and hand geometry measurement. Some of the possible applications of biometric recognition systems include user-interface customisation and access control such as airport check in, building access control, telephone banking or remote credit card purchases. Speech technology offers many possibilities for personal identification that is natural and non-intrusive. Besides that, speech technology offers the capability to verify the identity of a person remotely over long distance by using a normal telephone. A conversation between people contains a lot of information besides just the communication of ideas. Speech also conveys information such as gender, emotion, attitude, health situation and identity of a ENG499 CAPSTONE PROJECT REPORT 1 speaker. The topic of this thesis deals with speaker recognition that refers to the task of recognising people by their voices. 1.1 Development of speaker recognition systems The first type of speaker recognition system in the 1960’s uses spectrogram of voices, also known as voiceprint analysis. It is the acoustic spectrum of the voice that is similar to the fingerprint. However, this type of analysis could not fulfil the goal of automatic recognition as human interpretation was needed. In the 1980’s various methods were proposed to extract features from voice for speaker recognition that represented features in time, frequency or in both domains. Acoustic features of speech differ amongst individuals. These acoustic features include both learned behavioural features (e.g. pitch, accent) and anatomy (e.g. shape of the vocal tract and mouth) [10]. The most commonly extracted features are the Linear Predictive Coding (LPC), Linear Predictive Cepstral Coefficient (LPCC) and MelFrequency Cepstral Coefficients (MFCC) which belong to the short time analysis that provides information on the vocal tract [10]. Different modelling techniques were also developed to model voiceprint extracted from speech. Various concepts were introduced such as pattern matching (Dynamic Time Warping) which does direct template matching between training and testing subject. However, direct template matching is time consuming when the number of feature vectors increase. Clustering is a method to reduce the number of feature vectors by using a codebook to represent centres of the feature vectors (Vector Quantization). The LBG (Linde, Buzo and Gray) algorithm [25] and the k-means algorithm are some of the most well known algorithms for Vector Quantization (VQ). Other methods proposed for speaker modelling includes neural networks and also stochastic models that uses probability distribution such as Hidden Markov Model (HMM) and the Gaussian Mixture Model (GMM). ENG499 CAPSTONE PROJECT REPORT 2 Although the development in the field of speech technology is moving rapidly, there are a few inherent problems that have to be solved. The reliability of the speaker recognition drops drastically when a huge user database is used or when it is used under a noisy environment. However, the field of technology is moving fast and it may be possible to refine and improve the robustness of the existing techniques to solve some of the issues. 1.2 Project Objectives The principle objectives of this thesis are: 1. To study the concepts of speaker recognition and understand its uses in identification and verification systems. 2. To conduct research on different types of voiceprint in the field of speaker recognition and understand the details of the feature extraction methods. 3. To evaluate the recognition capability of different voice features and parameters to find out the method that is suitable for Automatic Speaker Recognition (ASR) systems in terms of reliability and computational efficiency. 1.3 Project Scope Although a lot of work has been done in the field of speaker recognition, there are many practical issues to be resolved before it can be implemented in the real world. The scope of this thesis is to make a general overview of the available techniques and to analyse the reliability of the various voiceprint features for use in ASR. In this project, an open set, text independent, speaker identification system prototype will be developed to conduct the above mentioned. ENG499 CAPSTONE PROJECT REPORT 3 1.4 Summary of Report Chapter 1: Introduction The background of speaker recognition systems is discussed. The overall project objectives and scope for this thesis are defined in this chapter. Chapter 2: Literature review An overview of the techniques used in speaker recognition systems will be discussed. Chapter 3: Project plan The Gantt chart showing the details of the project tasks will be discussed to show how the project is planned and executed. Chapter 4: Development of the speaker recognition system The details of the techniques used in this project to build the prototype speaker recognition system will be discussed in this chapter. Chapter 5: Experiment setup The details of the experiment setup to evaluate the different voice feature extraction methods will be shown. Chapter 6: Results The results of the different voice feature extraction methods will be shown and evaluated in this chapter. Chapter 7: Conclusion This chapter details the summary of the work accomplished and suggestions for further research. Chapter 8: Critical review and reflections The difficulties, successes and personal lessons learnt in this project are summarized in this chapter. ENG499 CAPSTONE PROJECT REPORT 4 2. Literature review The fascination with employing speech for the many purposes in daily life has driven engineers and scientist to conduct vast amount of research and development in this field. The idea of an “Automatic speaker recognition” (ASR) aims to build a machine that can identify a person by recognising voice characteristics or features that are unique to each person. The performance of modern recognition systems has improved significantly due to the various improvements of the algorithm and techniques involved in this field. As of this moment, ASR is still a subject of great interest to researchers and engineers worldwide and the efficiency level of ASR is still improving. This chapter aims to highlight some of the important techniques, algorithm and research that are relevant to this report. Various types of typical pre-processing techniques, feature extraction and speaker modelling techniques will be covered in this report. An overview of the advantages and typical applications of the techniques and algorithm in the speaker recognition system will be provided. Lastly, an overview of the comparison of the speaker recognition systems using algorithms and techniques that are explained in this report will be presented at the end of the chapter. ENG499 CAPSTONE PROJECT REPORT 5 2.1 Concepts of speaker recognition The typical classification of automatic speaker recognition is divided into two tasks: Speaker Identification (SI) and Speaker Verification (SV). Figure 2.1 shows the taxonomy of speech technologies. Speaker recognition is one of the three sub-classes of speech technology which is further subdivided into SI and SV tasks. Speaker Recognition Speaker Identification Text-dependent Open-set Closed-set Speaker Verification Text-independent Open-set Text-dependent Text-independent Closed-set Figure 2.1 Speaker recognition processing tree 2.2 Identification and Verification tasks Speaker recognition generally involves two main applications: speaker identification and speaker verification. Speaker identification (SI), or 1:N matching, is the process of finding the identity of an unknown speaker by matching the voice against the voice of registered speakers in the database. The system will then return the best matching speaker as the recognition decision. Speaker verification (SV), or 1:1 matching, is the process of verifying the claimed identity of the unknown speaker by comparing the voice of the unknown speaker against the voice of the claimed speaker in the database. The similarities between the speaker and the speaker template in the database will determine the recognition decision. 2.2.1 Text dependent and Text independent Task Speaker Identification tasks can be further divided into text-dependent and text-independent tasks. In the case of text-dependent systems, the system requires the user to utter words that ENG499 CAPSTONE PROJECT REPORT 6 has been enrolled. Text-independent systems do not require user to speak specific words in order to perform recognition tasks. The system will model the voice feature characteristics of the unknown speaker and perform recognition tasks. In general, text-dependent systems are more accurate as both the content and voice feature characteristics are compared. However, a serious flaw exists in both types of system. An intruder can access the system by using a pre-recorded voice of a registered speaker in the database. A method known as text-prompt speaker verification that randomly selected passwords for the user to utter is used to counter this problem. 2.2.2 Open and closed-set Identification Speaker identification can be further classified into open and closed set recognition. In open set recognition, the system is able to suggest that the voice from the unknown speaker does not match any speaker in the registered database. In closed set recognition, the voice will come only from the specified set of known speakers and the system is forced to make a decision based on the best matching speaker in the registered database. 2.3 Phases of Speaker Identification Automatic speaker recognition system identifies the person speaking based on a database of known speakers in the system [1]. Figure 2.2 shows the overview of an automatic speaker recognition system. In the training or enrolment phase, a new speaker with known identity is enrolled in the database of the system. In the identification phase, voice features from the unknown speaker are extracted and modelled. The speaker model is then used for comparison with speaker models from the enrolment phase to determine the identity of the speaker. Both enrolment and identification phase use the same modelling algorithms. ENG499 CAPSTONE PROJECT REPORT 7 Speech input Training mode Speaker modelling Feature Extraction Pattern matching Recognition mode Database of known speakers Decision Figure 2.2 Block diagram of automatic speaker identification system 2.4 Pre-processing techniques Pre-processing is a critical process performed on speech input in order to develop a robust and efficient speaker recognition system [2]. It is mainly performed in a few stages as shown in figure below. A/D End-point Detection PreEmphasis Feature Extraction Analogue Speech Signal Figure 2.3 Block diagram of the pre-processing stages 2.4.1 Analogue-To-Digital (A/D) The first stage is the Analogue-to-Digital (A/D) conversion where the analogue speech signal is converted into a digital signal. In local speaker identification systems, voices are normally recorded using microphones with sampling frequency that ranges from 8 KHz to 20 KHz [3]. However, the process of A/D conversion using microphones introduces an unwanted DC offset or constant component that may cause errors in the speaker modelling. It can be removed using two different methods as mentioned by Marwan [4]. The first method involves performing a fast Fourier transform on the digitized speech, removing the first frequency component of the transform and finally, performing an inverse Fourier ENG499 CAPSTONE PROJECT REPORT 8 transform. The second method involves subtraction of the average mean of the signal from the original signal to remove the DC offset. 2.4.2 End-Point Detection The second stage is the removal of silent segment from the captured speech signal, otherwise known as end-point detection. The two main reasons for doing this is firstly, most of the speaker specific information or features reside in the voiced segment of the speech signal. Secondly, removal of the silent segment reduces unnecessary computation which improves the efficiency of the ASR. The two most widely used end-point detection methods in use are the short-time energy based method (STE) and the zero-crossing method (ZCR). STE uses the fact that silence segment of speech signal has very low short time energy. Average energy of the signal is computed and segments of the speech signal with energy lower than the threshold set are removed. The reliability of the end point detection depends very much on the threshold chosen. Changes to the threshold value might be required under different ambient/noise conditions [2]. ZCR refer to the rate that the amplitude of the sound wave changes sign. It uses the theory that silent segments of signal have a higher ZCR than the voice segment. The typical silence segment has a ZCR of 50, and a typical voiced segment has a ZCR of about 12 [5]. A study to combine STE and ZCR for end-point detection by Mark Greenwood in 1999 shows an average accuracy of 65% [6]. End-point detection is a field widely researched upon and various techniques have been proposed that can achieve better performance than the conventional STE and ZCR [5; 7]. However, the STE and ZCR still remains the most widely used method for speaker recognition system due to their simplicity and ease of implementation. ENG499 CAPSTONE PROJECT REPORT 9 2.4.3 Pre-Emphasis The third stage is to perform pre-emphasis by passing the signal through a high pass filter. The purpose of pre-emphasis is to offset the attenuation due to physiological characteristics of the speech production system and also to enhance higher frequencies to improve the efficiency of the analysis as most of the speaker specific information lies within the higher frequencies. A study by Li Liu [8] shows that pre-emphasis does improve the performance of the ASR. Another advantage is that pre-emphasis does not require complex computation, as such the computation time of the ASR will not be increased much [9]. 2.4.4 Speech analysis technique or framing Speech data contains information that represents speaker identity. The physical structure of the vocal tract, excitation source and behavioural traits are unique for each person. Selection of proper frame size and overlap for analysis is crucial in order to extract relevant features that represent speaker identity [10]. Segmental analysis uses frame size and overlap ranging from 10-30 ms that captures information from the vocal tract. The quasi-stationary nature of the frames makes it suitable for practical analyses and processing for vocal tract information. Sub-segmental analysis uses small frame size and overlap ranging from 3-5 ms that is more suitable to capture characteristics of the excitation source which is relatively fast compared to vocal tract information. Supra-segmental analysis uses large frame size and overlap ranging from 100-300 ms to capture behaviour traits such as word duration, speaking rate, accent and intonation information that varies much relatively slower to vocal tract information. ENG499 CAPSTONE PROJECT REPORT 10 2.5 Voice features extraction Voice feature extraction, otherwise known as front end processing is performed in both recognition and training mode. Feature extraction converts digital speech signal into sets of numerical descriptors called feature vectors that contain key characteristics of the speaker. 2.5.1 Linear Predictive Coding (LPC) Linear predictive coding (LPC) is one of the earliest standardized coders. LPC has been proven to be efficient for the representation of speech signal in mathematical form. LPC is a useful tool for feature extraction as the vocal tract can be accurately modelled and analysed. Studies have shown that the current speech sample is highly correlated to the previous sample and the immediately preceding samples [11]. LPC coefficients are generated by the linear combination of the past speech samples using the autocorrelation or the autovariance method and minimizing the sum of squared difference between predicted and actual speech sample. is the predicted based on the summation of past samples. is the linear prediction coefficients. M is the number of coefficients and n is the sample. The error between the actual sample and the prediction can then be expressed by ENG499 CAPSTONE PROJECT REPORT 11 The speech sample can then be accurately reconstructed by using the LP coefficients the residual error . and can be represented by the following in z domain. The figure below shows the analysis filter Figure 2.4 Speech Analysis Filter The transfer function H(z) can be expressed as an all pole function , where G represents the gain of the system. The figure below shows the speech synthesis filter Figure 2.5 Speech Synthesis Filter Schroeder [12] mentioned that the LPC model can adequately model most speech sound by passing an excitation pulse through time-varying all-pole filter using LP coefficients. S. Kwong [13] considers LPC as a method that provides a good estimate of the vocal tract spectral envelope. Gupta [14] mentioned that LPC is important in speech analysis because of the accuracy and speed with which it can be derived. The feature vectors are calculated by LPC over each frame. The coefficients used to represent the frame typically ranges from 10 to 20 depending on the speech sample, application and number of poles in the model. However, LPC also have disadvantages. Firstly, LPC approximates speech linearly at all frequencies that is inconsistent with the hearing perception of humans. Secondly, LPC is ENG499 CAPSTONE PROJECT REPORT 12 very susceptible to noise from the background which may cause errors in the speaker modeling. 2.5.2 Linear Predictive Cepstral Coefficients Linear predictive cepstral coefficients (LPCC) combine the benefits of LPC and cepstral analysis and also improve the accuracy of the features obtained for speaker recognition. LPCC is equivalent to the smooth envelop of the log of the speech that allows for the extraction of speaker specific features. The block diagram of the LPCC is shown in the figure below. A/D Pre emphasis Speech Input Framing/Windowing Linear Predictive Coefficient Linear Predictive Cepstral Coefficient Figure 2.6 Block diagram of Linear Predictive Cepstral Coefficient LPC is transformed into cepstral coefficients using the following recursive formula where and are the ith-order cepstrum coefficient and linear predictor coefficient, respectively. Atal [15] did a study on various parameters for the LPC and found the cepstrum to be the most effective parametric for recognition for speakers. Eddie Wong [16] mentioned that LPCC is more robust and reliable than LPC. However, LPCC also performs poorly under noisy environment. ENG499 CAPSTONE PROJECT REPORT 13 2.5.3 Mel-Frequency Cepstral Coefficients (MFCC) Mel-frequency Cepstral coefficient is one of the most prevalent and popular method used in the field of voice feature extraction. The difference between the MFC and cepstral analysis is that the MFC maps frequency components using a Mel scale modeled based on the human ear perception of sound instead of a linear scale [17]. The Mel-frequency cepstrum represents the short-term power spectrum of a sound using a linear cosine transform of the log power spectrum of a Mel scale. The formula for the Mel scale is Linear Frequency vs Mel Frequency 3500 Mel Frequency(mels) 3000 2500 2000 1500 1000 500 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Frequency(Hz) Figure 2.7 Mel Scale plot Vergin [18] mentioned that MFCC as frequency domain parameters are much more consistent and accurate than time domain features. Vergin [18] listed the steps leading to ENG499 CAPSTONE PROJECT REPORT 14 extraction of MFCCs: Fast Fourier Transform, filtering and cosine transform of the log energy vector. According to Vergin [19], MFCCs can be obtained by the mapping of an acoustic frequency to a perceptual frequency scale called the Mel scale. MFCCs are computed by taking the windowed frame of the speech signal, putting it through a Fast Fourier Transform (FFT) to obtain certain parameters and finally undergoing Mel-scale warping to retrieve feature vectors that represents useful logarithmically compressed amplitude and simplified frequency information [20]. Seddik [21] mentioned that MFCC are computed by applying discrete cosine transform to the log of the Mel-filter bank. The results are features that describe the spectral shape of the signal. Rashidul [17] describe the main steps for extraction of MFCC, shown on figure. The main steps are as follow: pre-emphasis, framing, windowing, perform Fourier fast transform FFT), Mel frequency warping, filter bank, logarithm, discrete Cosine transform (DCT). A/D Pre emphasis Framing/ Windowing Speech Input Fourier Transform Melfrequency wrapping Logarithm Discrete Cosine Transform Mel-Frequency Cepstral Coefficients Figure 2.8 Block diagram of Mel-Frequency Cepstral Coefficient The main advantage of MFCC is the robustness towards noise and spectral estimation errors under various conditions [22]. A. Reynolds did a study on the comparison of different features and found that the MFCC provides better performance than other features [23]. ENG499 CAPSTONE PROJECT REPORT 15 2.6 Summary of feature extraction techniques A summary of the feature extraction technique is compiled in table 2.1 that compares the techniques mentioned in this report in terms of filtering, relevant variables, inputs and corresponding outputs Process Technique Feature Extraction Linear Predictive Coding (LPC) Linear Predictive Cepstral Coefficients Mel-Frequency Cepstral Coefficient (MFCC) Type of Filter All Pole Filter Relevant variables/Data Output structure Statistical Features Linear Predictive Linear Predictive Coefficients (LPC) Coefficients All Pole Statistical Features Linear Predictive Filter Linear Predictive Cepstral Cepstral Coefficients Coefficients (LPCC) Mel-Filter Statistical Features Mel-Frequency Bank Mel-Frequency Cepstral Cepstral Coefficients Coefficients (MFCC) Table 2.1 Comparison of features extraction in terms of filtering techniques Criteria Main Task LPC Features extracted by analysing past speech samples. LPCC Features extracted by combining LPC with spectral analysis Speaker Dependence High Speaker dependent Poor Speech production motivated representation All-Pole Filters Speech compression High Speaker dependent Poor Speech production motivated representation All-Pole Filters Speaker and speech recognition Robustness Motivation Representation Filter Bank Typical Applications MFCC Features extracted based on frequency domain using Melscale that represents human hearing Moderate Speaker dependent Good Perceptually motivated representation Triangular Mel Filters Speaker and speech recognition Table 2.2 Comparison of criteria of feature extraction techniques ENG499 CAPSTONE PROJECT REPORT 16 2.7 Speaker Modeling The objective of modeling techniques is to generate patterns or speaker models for feature matching. The speaker models are models that contain enhanced speaker specific information with a compressed rate [10]. In the training or enrolling mode, speaker models are built using the specific voice features extracted from the current speaker. In the recognition mode, the speaker model is used to compare with the current speaker model for identification or verification purposes (24). Target/Speaker model Front-end process + Σ Impostor/ background model Λ>θ Accept Score Normalisation - Λ<θ Reject Figure 2.9 Block diagram of speaker decision Three main types of modeling techniques were mentioned, namely: template matching, stochastic modeling, neural networks. 2.7.1 Template Matching In template matching, the speaker model may simply contain the feature template of the frame of speech. A matching score will be computed by calculating the distance of the input feature template and the model templates in the system database to determine the identity of the speaker. 2.7.1.2Dynamic Time Wrapping (DTW) Dynamic time wrapping is one of the most popular and widely used template based methods in use for text-dependent speaker recognition system. DTW is a technique that uses dynamic programming to process text-dependent input feature vectors to removes the effect of speech rate variability by the speakers. The matching score is computed by comparing the feature ENG499 CAPSTONE PROJECT REPORT 17 vectors frame by frame with the speaker model in the database and is used to identify the speaker [1]. 2.7.2 Vector Quantization source modelling Vector Quantization is an efficient way that compresses large training vectors by using codebooks. Codebooks contain the numerical representation of features that are speaker specific. The speaker specific codebook is generated in the training phase by clustering the feature vectors of each speaker (as shown in figure.). In the recognition stage, input utterances are vector quantised and the VQ distortion that is calculated over the entire utterance is used to determine the identity of the speaker. Figure 2.10 2-dimensional VQ with 32 regions. Retrieved on September 2009 from www.data-compression.com/vq.html There are many types of codebook generation algorithm but the most well known and widely applied are the K-means algorithm [25; 26] and the Linde, Buzo and Gray (LBG) algorithm [27]. ENG499 CAPSTONE PROJECT REPORT 18 The steps for K-means algorithm are: 1. Cluster the vectors based on attributes into k partitions of centroids. 2. Assign the feature vectors to centroid that is nearest to it 3. Calculate the position of the K centroids using the mean of the distance between the feature and the centroid. 4. Repeat step 2 and 3 until the position of the centroid no longer changes. Figure 2.11 K-means clusters. Retrieved on September 2009 from http://cmp.felk.cvut.cz/cmp/software/stprtool/examples.html Advantages of K means clustering include lies in its simplicity and ease of computation. However it does not guarantee the classification of speech spacing to be optimal. ENG499 CAPSTONE PROJECT REPORT 19 The steps for Linde, Buzo and Gray (LBG) algorithm are: 1. Determine number of codewords, N or size of the Codebook centroid. 2. Selection of random codeword for initial centroid. 3. Spilt the centroid into two codeword. 4. Compute the Euclidean distance to cluster the vectors around the codeword. 5. Compute the new set of codeword 6. Compute the distortion using Euclidean distance 7. Repeat step 4 to 6 until the codewords do not change or change in codewords is small. 8. Repeat step 3 until the desired number of codewords, N is attained. Figure 2.12 Steps for LBG algorithm. Retrieved on September 2009 from https://engineering.purdue.edu/people/mireille.boutin.1/ECE301kiwi/LBGAlgorithm ENG499 CAPSTONE PROJECT REPORT 20 The advantage of LBG lies in the generation of accurate codebooks with minimum distortion when a good quality initial codebook is used for LBG. However, due to the complexity, the computation cost is high [28]. 2.7.3 Stochastic Models Using the stochastic model methods, the pattern matching are formulated by measuring the likelihood of the feature vector from the input model matching the feature vector of the speaker model in the database [24]. One of the most popular methods is the Hidden Markov Model (HMM). 2.7.3.1Hidden Markov Model Magdi [29] mentioned that HMM are extremely useful for modeling sequentially changing behavior as in speech application. Figure 2.13 Probabilistic parameters of a hidden Markov model (example) x — states y — possible observations a — state transition probabilities b — output probabilities Retrieved on September 2009 from http://www.answers.com/topic/hidden-markov-model Rabiner [30] defines HMM as “a doubly embedded stochastic process that is hidden but can be observed by using another set of stochastic process that produces the sequence of ENG499 CAPSTONE PROJECT REPORT 21 observations.” This means that the steps leading to state of the HMM is not directly visible but the output which is dependent on the state is visible. Each state has a probability distribution to the possible output state, as such, the sequence of functions generated by the HMM gives information about the sequence of states. The advantage of HMM technique lies in its ability to capture the voice/unvoiced information in the states that reduce its reliant on good voice/unvoiced segmentation techniques. Information such as intonation, accent can also be captured into the states [31]. However, Rabiner [30] identify three basic problems to be solved with HMM for it to be useful for practical applications. The three main concerns are the evaluation problem, the decoding problem and the learning problem. Evaluation refers to how well a given model can match the given observation sequence. Decoding refers to the attempt to uncover the hidden part of the model to find the correct state sequences and finally, learning, which refers to the attempt to optimize the parameters to observe training data to in order to create a good model. 2.8 Neural Networks The ability of neural networks to recognise patterns of different classes makes it suitable for speaker recognition. Typical neural network has three main components, the top layer, the hidden layer, which can be one or more layers, and the output layer [10]. Each of the layers contains processing units that represents the neuron, which are interconnected. During the training phase, the weights of the neurons are adjusted using a training algorithm that attempts to minimize the sum of squared difference between the desire and actual values of the output neurons. The weights are adjusted over several training iterations until the desired sum of squared difference between desire and actual values are attained [32]. To put it simply, neural networks are used to model the pattern between inputs and outputs. ENG499 CAPSTONE PROJECT REPORT 22 2.9 Summary Table 2.3 below shows the comparison between different feature extraction and modelling techniques. No. 1. 2. 3. 4. 5. 6. 7. System Reference Source Features Applied Dimension Speaker Identification using Mel Frequency Cepstral Coefficients A continuous HMM text-independent speaker recognition system based on vowel spotting Automatic Speaker Identification Using Vector Quantization Text-dependent Speaker Identification Using Neural Network On Distinctive Thai Tone Marks A Vector Quantization Approach to Speaker Recognition [17] MFCC [33] [27] LPC 9 VQ-LBG 67% to 98% Speaker Identification using Cepstral Based Features and Discrete Hidden Markov Model Speaker-Independent Phone Recognition Using Hidden Markov Models [36] LPCC 10-20 HMM 83.93% to 87.5% MFCC 10-20 LPC 12 Recognition rate 20 Modelling techniques applied VQ MFCC 19 HMM 59.17% to 98.85% [34] MFCC 12 VQ 65% to 98% [35] LPCC 10 Neural 68.89% to 95.56% 57% to 100% Network [37] 75% to 83.93% HMM 64.07% to 73.80% Table 2.3 Comparison of different feature extraction and modelling techniques Selection of LPC, LPCC & MFCC for feature extraction and VQ-LBG for modelling was based on their proven high level of performance in ASR, comprehensive list of materials that is readily available for referencing, ease of implementation, and also their popularity in the field of speaker recognition. In addition, the combination of the above mentioned feature extraction together with VQLBG also shows relatively good results in various papers as shown in table. This report aims ENG499 CAPSTONE PROJECT REPORT 23 to examine and validate the various feature extraction methods mentioned above for their suitability in ASR. Achieving a recognition rate of 85% and above for the recognition system indicates the usefulness of the above feature extraction and modelling method mentioned. This chapter has presented an overview of the most widely known concepts in the field of ASR. The steps to design an ASR begin with pre-processing, feature extraction, modelling and finally, comparison using distance methods. The LPC, LPCC and MFCC are the most popular feature extraction methods while HMM, Neural Networks and VQ are the most popular modelling techniques widely employed in most modern day ASR. ENG499 CAPSTONE PROJECT REPORT 24 3. Project Plan The aim of a project plan is to list down the tasks and the timeframe for it to be completed in a systematic manner to ensure the successful completion of the project. Main tasks are listed below: Stage 1 (Background information research). This stage focuses on the background information such as the origin, terminology, concepts, uses and limitations of ASR. Sources include IEEE journals, Internet and books that explain the basic terminology and concepts used in ASR. Verification using IEEE journals, books and various other readings was done to ensure the soundness of information collected. Stage 2 (Preparation of initial report) Preparation of the initial report was conducted simultaneously together with the background research. This stage took three weeks to complete. Stage 3 (Research on algorithms and voice print features used in speaker recognition systems) Important technical details of the project are subdivided into pre-processing, feature extraction, modelling techniques and comparison methods for clearer focus during research. Selection of techniques and algorithm for the implementation of the speaker recognition system will be done in this stage. Stage 4 (Learning of Matlab programming and development of prototype software for speaker recognition system) At this stage, the aim is to be proficient in Matlab programming so as to develop an ASR using the Graphical User Interface (GUI) platform in Matlab. It is divided into three subsections: ENG499 CAPSTONE PROJECT REPORT 25 1. Familiarisation with Matlab functions – Learning the basic functions, operators and programming method used in Matlab. 2. Learning Matlab GUI functions – Learn and understand the Matlab GUI platform, functions and method of building a GUI Interface. 3. Build GUI Interface – Construct the basic GUI interface for the ASR. Stage 5 (Software simulation and evaluation of different algorithms and voice print features using the prototype software developed) At this stage, the various stages involved in building an ASR will be implemented and Matlab codes written will be tested. Programming modules are divided into six subsections: 1. Pre-processing 2. LPC 3. LPCC 4. MFCC 5. Modelling techniques 6. Comparison techniques Phase 6 (Project review): At this stage, the testing and evaluation of voice features will be done and the results will be compiled for analysis. The three main subsections are 1. Compilation of results – Testing results will be compiled and tabulated. 2. Analysis of results – Results will be evaluated to determine the effective of the features for speaker recognition. 3. Assessment of difficulties encountered - Difficulties encountered during the project are noted down and possible solutions will be suggested in the final report. ENG499 CAPSTONE PROJECT REPORT 26 Phase 7 (Preparation of final report – Thesis): This stage is the preparation of the final report. It is subdivided into different sections in order to provide a clearer objective so as to complete the report writing in time. A review will be conducted to finalise and make necessary amendments. Phase 8 (Preparation of oral presentation): The final task will be the preparation of the oral project presentation. The details for the poster will be finalised and sent for printing, and preparation for the presentation will be done. ENG499 CAPSTONE PROJECT REPORT 27 3.1 Gantt Chart ENG499 CAPSTONE PROJECT REPORT 28 Voice print analysis for speaker recognition system Start Date Duration End Date Task Description 64 26-Jan-09 30-Mar-09 1. Background information research 26-Jan-09 12 6-Feb-09 1.1 Study relevant articles and papers 7-Feb-09 8 14-Feb-09 1.2 Overview of Speaker Recognition Systems 15-Feb-09 44 30-Mar-09 1.3 Techniques for feature extractions 15-Feb-09 44 30-Mar-09 1.4 Modelling techniques for ASR 26 2-Feb-09 27-Feb-09 2. Preparation of initial report (TMA01) 3. Research on techniques used for ASR 3.1 Review on feature/voiceprint extraction techniques 3.2 Review of modelling techniques for database 3.3 Review of comparison techniques for speaker recognition 4. Learning of Matlab programming and development of prototype software for speaker recognition system 4.1 Familarising of Matlab functions and commands 4.2 Matlab GUI functions 4.3 Building GUI for ASR 5. Software simulation and evaluation of different algorithms and voice print features using the prototype software developed 5.1 Implementing Pre-processing in Matlab 5.2 Implementing LPC in Matlab 5.3 Implementing LPCC in Matlab 5.4 Implementing MFCC in Matlab 5.5 Implementing VQ-LBG in Matlab 5.6 Implementing comparison technique in Matlab 6. Project review 6.1 Compilation of results 6.2 Analysis of results 6.3 Assessment of difficulties encountered 7. Preparation of final report 7.1 Writing skeleton of final report 7.2 Writing literature review of report 7.3 Writing introduction of report 7.4 Writing main body of report 7.5 Writing conclusion and further study 7.6 Finalising and amendments to final report 8. Preparation of oral presentation 28-Feb-09 70 8-May-09 28-Feb-09 29 28-Mar-09 29-Mar-09 19-Apr-09 21 20 18-Apr-09 8-May-09 9-May-09 36 13-Jun-09 9-May-09 8 16-May-09 17-May-09 31-May-09 14-Jun-09 14 14 58 30-May-09 13-Jun-09 10-Aug-09 14-Jun-09 28-Jun-09 5-Jul-09 12-Jul-09 19-Jul-09 26-Jul-09 14 7 7 7 7 16 27-Jun-09 4-Jul-09 11-Jul-09 18-Jul-09 25-Jul-09 10-Aug-09 14-Jun-09 14-Jun-09 9-Aug-09 4-Oct-09 14-Jun-09 14-Jun-09 28-Jun-09 2-Aug-09 15-Aug-09 9-Sep-09 1-Oct-09 12-Oct-09 121 57 56 8 121 14 35 13 25 22 12 40 11-Oct-09 9-Aug-09 3-Oct-09 11-Oct-09 11-Oct-09 27-Jun-09 1-Aug-09 14-Aug-09 8-Sep-09 30-Sep-09 11-Oct-09 22-Nov-09 Resources IEEE ASP, Internet, Library resources Personal Computer & Matlab Table 3.1 Details of project plan ENG499 CAPSTONE PROJECT REPORT 29 4. Development of the speaker recognition system The main purpose of the prototype system is to compare the recognition rate in order to determine the suitability of the different types of features to be use in a speaker recognition system. The development of the prototype speaker recognition system will be done using Matlab. This chapter will describe in detail the techniques used for the pre-processing and voice feature extraction stages. The figure below shows the flowchart of the program development for feature extraction. Select and open wav file of speaker Perform end-point detection Perform Pre-Emphasis Framing/Windowing Perform feature extraction LPC MFCC LPCC Create Speaker Model using LBG algorithm Figure4.1Flowchart of prototype speaker recognition system ENG499 CAPSTONE PROJECT REPORT 30 4.1 DC Offset Removal The average of the signal will be computed and subtracted from the original signal. 4.2 End-point detection Endpoint detection refers to the removal of silence portion of the speech data. STE will be method implemented for this process in this project. Basically, the speech signal will be divided into 0.5ms frames and compared with the average energy of the speech signal. Frames with energy below the threshold set will be discarded. Retained frames will be combined to form the final speech data for further speech processing. Speech data wih DC offset removed Calculate average energy of speech signal Split the speech signal into 0.5ms frames Calculate energy of split speech signal Calculate next frame Framed speech signal energy is compared with the average energy Frame energy < Average speech energy Speech frame is discarded Frame energy >= Average speech energy Speech frame is retained Combine speech frames retained to retrieve speech signal Figure 4.2 Flowchart of end-point detection for prototype speaker recognition system ENG499 CAPSTONE PROJECT REPORT 31 Average energy of speech frame is computed using this formula: Figure 4.3 Speech data before and after End-Point Detection 4.3 Pre-Emphasis Pre-emphasis is a technique used to enhance the high frequencies of speech signal. There are two important factors for doing this: 1. To enhance the speaker specific information in the higher frequencies of speech 2. To negate the effect of energy decrease in higher frequencies in order to enable proper analysis on the whole spectrum of the speech signal. Figure below shows the speech signal before and after pre-emphasis. In this project preemphasis is implemented as a first-order Finite Impulse Response (FIR) filter with the form: ENG499 CAPSTONE PROJECT REPORT 32 Generally is selected between 0.9 to 0.95 Figure 4.4 Frequency response and z-plane plot of the pre-emphasis filter with = 0.95 Figure 4.5 Speech data before and after End-Point Detection Figure 4.6 Power spectrum before and after End-Point Detection ENG499 CAPSTONE PROJECT REPORT 33 4.4 Framing/Windowing Speech signals are quasi-stationary when examined over small time intervals from 15ms to 30ms. The process of framing is to divide the speech signal into smaller segments to make it suitable for practical analyses and processing for vocal tract information. Overlapping is required to fully capture the speaker specific features in the speech data. Windowing is performed on the framed signal to smooth the abrupt frequencies at the end points of the frames. In this project, the speech signal is divided into fixed frames of 20ms with an overlap of 50%. A Hamming window will be used to smooth the abrupt and undesirable frequencies in the speech frames. Figure 4.7 Speech frame before and after Hamming window 4.5 Feature extraction Evaluation of the different types of feature extracted from voice to determine their suitability for ASR is the main focus for this project. The current most popular and widely known features used for ASR are the Linear Prediction Coefficients (LPC), Linear Prediction Cepstral Coefficients (LPCC) & Mel-Frequency Cepstral Coefficients (MFCC). As such, this project will use the three features mentioned to evaluate their suitability for implementation in ASR. ENG499 CAPSTONE PROJECT REPORT 34 4.5.1 Linear Prediction Coefficients (LPC) In this project, the LPC coefficients are retrieved by passing the speech frames into the LPC function in Matlab. Basically the Matlab function uses the autocorrelation method of autoregressive (AR) modelling to find the filter coefficients. LPC computes the least square solution to X a b where and m is the length of x. Solving the least squares problem using normal equations X H Xa X Hb Leads to the Yule-Walker equation R= [ r(1) r(2) r(3) …. r(p+1) ] are autocorrelation estimates for x computed using xcorr. The Yule-Walker equations are solved using the Levinson-Durbin algorithm. In this project, the orders of LPC being used are the 8th, 12th, 16th and 20th. 4.5.2 Linear Prediction Cepstral Coefficients (LPCC) LPCC is a technique that combines LP and cepstral analysis by taking the inverse Fourier transform of the log magnitude of the LPC spectrum for improved accuracy and robustness of the voice features extracted. ENG499 CAPSTONE PROJECT REPORT 35 Frame of windowed speech signal Extract Linear Prediction Coefficients using lpc function in Matlab Extract Linear Prediction Cepstral Coefficients using recursive formula No End of speech frames? Yes Complete computation of LPCC Figure 4.8 Flowchart of computation of LPCC For this project, the recursive formula used for the calculation of the LPCC is: In this project, the orders of LPCC being used are the 8th, 12th, 16th and 20th. ENG499 CAPSTONE PROJECT REPORT 36 4.5.3 Mel-Frequency Cepstral Coefficients (MFCC) MFCC uses banks of filters to wrap the frequency spectrum onto the Mel-scale that is similar to how the human ear perceives sound. The filters of the Mel-scale are linear at low frequencies but logarithm at high frequencies to imitate the human hearing perception. For this project, the filters of the mfcc will be adapted from Voicebox: Speech Processing Toolbox for MATLAB by Mike Brooks. In this project, MFCC are extracted by passing the frames of the windowed speech signal into the mfcc.m function written. Frame of windowed speech signal Fast Fourier Transform Mel-frequency wrapping by applying filterbank function melbank.m from Voicebox Logarithm Discrete Cosine Transform Mel-Frequency Cepstral Coefficients No End of speech frames? Yes Complete computation of MFCC Figure 4.9 Flowchart of computation of MFCC ENG499 CAPSTONE PROJECT REPORT 37 4.6 Speaker modelling Speaker model is the compact representation of the speaker voice features. In this project, the speaker models are generated using the Vector Quantization-LBG method. The speaker feature coefficients are passed into the function to generate the codebook of sizes 32, 64 and 128. The rationale for choosing the VQ-LBG method is the ease of implementation and comparable performance to other speaker modelling methods. This project uses the square Euclidean distance measurement for speaker similarity measure. The only difference between the square Euclidean and normal Euclidean is that the square root is not taken for this instance. The formula used in the function is: d ( x, y ) ( x1 y1 ) 2 ( x 2 y 2 ) 2 ....( x n y n ) 2 d i ( xi , x ) ( xi x ) T ( xi x ) ENG499 CAPSTONE PROJECT REPORT 38 5. Experimental Setup This chapter will discuss the setup to analyse the various techniques and algorithms to determine the suitability of the techniques in speaker recognition systems. Speaker voice data The speaker voices are recorded in a quiet room using the windows audio recorder software and the sonic gear microphone. The sampling rate for the recording is set at 16KHz. Users were asked to utter the single digit sequence from ‘0’ to ‘9’ for eighteen repetitions. Fifteen of the samples will be combined into one single file for the purpose of training the speaker model. Three of the samples will be used to test the recognition capability of the different voiceprint in the prototype ASR developed. Language English Speakers 15 (6 Males, 9 Females) Speech Type Single digits from ‘0’ to ‘9’ Recording Condition Relatively clean Sampling Frequency 16 kHz Training Speech Duration Approximately 90 seconds Evaluation Speech Duration Approximately 5 seconds Table 5.1 Voiceprint test parameters The speech pre-processing, feature extraction and speaker modelling used are as mentioned in chapter 4. Prototype development The prototype ASR will be developed using Matlab GUI. ENG499 CAPSTONE PROJECT REPORT 39 6. Results 6.1 Linear Predictive Coefficients The first method to be evaluated is the LPC derived voice features. LPC is seldom used by itself for speaker recognition in modern day ASR but in this project it will serve as a basis for comparison for the other methods. The results are shown in figure 6.1, 6.2 and 6.3 Figure 6.1 Results of LPC for codebook size of 32 From figure 6.1, there is an effect of varying the order of LPC. It is observed that LPC using 8 coefficients has a better recognition rate than other LPC coefficients for codebook of size 32. The results however are not unexpected. The two most significant factors that affect the recognition results are the quality of the speech signal together with the size of the codebook. Increasing the size of the codebook and LPC coefficients increases the effect of noise on the signal, as the signal will contain more information where noise can be present. ENG499 CAPSTONE PROJECT REPORT 40 Figure 6.2 Results of LPC for codebook size of 64 The results obtained for LPC using codebook size of 64 in figure 6.2 are pretty much similar to those using codebook sizes of 32. The recognition rate decreases from 66.67%, to hovering around 40% to 53.33%, as the number of coefficients used increases. Figure 6.3 also shows similar recognition for codebook of size 128 as compared to codebook of size 64 and 32. Figure 6.3 Results of LPC for codebook size of 128 ENG499 CAPSTONE PROJECT REPORT 41 6.1.2 Conclusion of LPC Figure 6.4 Comparison of LPC using different codebook sizes The results show that the accuracy of the recognition rate of features extracted using LPC ranging from 40% to 66.67%. LPC using lesser coefficients generally have better recognition rates than LPC using higher coefficients. It is observed that LPC recognition rates are more consistent when a codebook of size 32 is employed. The possible reasons for the poor recognition rate might be firstly, the insufficient speaker specific information in LPC, and secondly, susceptibly to noise within the signal that can causes inaccuracy in the features extracted. The findings have shown that LPC does not perform well enough to be used for a secure ASR. ENG499 CAPSTONE PROJECT REPORT 42 6.2 Linear Predictive Cepstral Coefficients The second method to be evaluated is the LPCC derived voice features. LPCC is computed from LPC and is one of the most popular used for speaker recognition in modern day ASR. The results are shown in figure 6.5, 6.6, 6.7. Figure 6.5 Results of LPCC for codebook size of 32 From figure 6.5, there is an effect in varying the order of LPCC. It is observed that the recognition rate increases when the order of LPCC increases using codebook of size 32. The recognition rate increases from 73.33% (LPCC8) to 93.33% (LPCC12 & LPCC16) and drops to 86.67% (LPCC20). From that finding, we can see that the recognition rate does not increase all the time just by increasing the order of the LPCC. In fact, LPCC experiences a drop in the recognition rate when higher order coefficients are used. This tally with the study by Reynolds [23] where the LPCC recognition rate averages at 90% and drops when higher order LPCC is used. ENG499 CAPSTONE PROJECT REPORT 43 Figure 6.6 Results of LPCC for codebook size of 64 The results obtained for LPCC using codebook size of 64 in figure 6.6 are pretty much similar to those using codebook sizes of 32. The recognition rate increases from 73.33% and stagnated at 93.33%. The finding shows that increasing of the order of LPCC will only work for low order coefficients. Increasing the order of LPCC will not have any effect beyond that. Figure 6.7 Results of LPCC for codebook size of 128 ENG499 CAPSTONE PROJECT REPORT 44 Figure 6.7 also shows similar recognition for codebook of size 128. The results are similar to LPCC using codebook of size 32 and 64. The recognition rates peak at the 16th order of LPCC. 6.2.1 Conclusion of LPCC Figure 6.8 Comparison of LPCC using different codebook sizes The results show that the accuracy of the recognition rate of features extracted using LPCC ranging from 73.33 % to 100%. Results show that LPCC using lesser coefficients have the worst recognition. The recognition rate peaks at the 16th order of the LPCC for all codebook sizes. The findings show that the order of LPCC used will affect the recognition rate, however, increasing the order beyond (order > 16) increases the computation time and causes recognition to drop. The findings are expected as LPCC is less efficient when higher order coefficients are used as superfluous information or noise will be included in the modelling of the speaker model resulting in lower recognition rates. The results also show LPCC to be more robust and accurate than LPC. ENG499 CAPSTONE PROJECT REPORT 45 6.3 Mel-Frequency Cepstral Coefficients The third method to be evaluated is the MFCC derived voice features. MFCC are coefficients that represent sound based on human perception. This is also one of the most popular used for speaker recognition in modern day ASR. MFCC are derived by taking the Fourier Transform of the signal, warping it to by using a Mel-filter bank that closely mimic the Mel-scale, the final step is to perform Discrete Cosine Transform on the logarithm power of the speech frame from the Mel-scale output. The results are shown in figure 6.9, 6.10, 6.11. Figure 6.9 Results of MFCC for codebook size of 32 From figure 6.9, the effect of varying the order of the MFCC does not seemed to have much effect on the recognition rate. The recognition rate increases from 80% (MFCC8) and stays stagnant at 93.33% (MFCC12, MFCC16 and MFCC20). The results of MFCC using codebook size of 32 shows that MFCC function better than LPCC and LPC when using smaller size codebooks. This might be due to the MFCC being more immune to noise that affects the LPC and LPCC. ENG499 CAPSTONE PROJECT REPORT 46 Figure 6.10 Results of MFCC for codebook size of 64 The results obtained from figure 6.10 shows a recognition rate of 93.33% across all the orders of the MFCC used. From the results, varying the orders of the MFCC does not show any effect of increasing or decreasing the recognition rate. Figure 6.11 Results of MFCC for codebook size of 128 ENG499 CAPSTONE PROJECT REPORT 47 Figure 6.11 also shows similar recognition for codebook of size 128. The results are similar to MFCC using codebook of size 32. The recognition rate peaks at the 93.33%. 6.3.1 Conclusion of MFCC Figure 6.12 Comparison of MFCC using different codebook sizes The results show the recognition rate of MFCC ranging from 80 % to 100%. MFCC using lesser coefficients (MFCC8) have the worst recognition. The recognition rate peaks at the 12th order of the MFCC for all codebook sizes. The findings show that the order of MFCC used affects the recognition rate. However, increasing the order beyond (order > 12) will not be more beneficial for the recognition rate of the system. It is also observed that the MFCC using codebook size of 64 is the most consistent. Codebook size determine the amount of features vectors stored for comparison and the findings show that utilising codebook sizes of 64 gives the best results for speaker recognition. ENG499 CAPSTONE PROJECT REPORT 48 Overall, the MFCC recognition rate is better when compared to the LPC and LPCC. It did not achieve a 100% for speaker recognition but the recognition rates are more consistent than LPC and LPCC. The worst recognition rate was 80% for the MFCC8 and the best recognition rate is at 93.33% for MFCC12, MFCC16 and MFCC 20. The findings are consistent with MFCC being known to be more robust to noise and spectral estimation errors when higher order coefficients are used. (Recognition rates maintained for higher orders. (>12)) ENG499 CAPSTONE PROJECT REPORT 49 6.4 LPCC+MFCC Based on the above results retrieved for the different voice features, this method aims to combine the two features LPCC and MFCC to achieve better recognition rate by considering supplementary information sources. This is accomplished by using output fusion that model individual data separately and combining them at the output to give the overall matching score. The figure below shows the structure of the proposed system. Speaker models from LPCC16, codebook size 32 Compare with speaker models LPCC Score 1 weighting MFCC Score 2 weighting Compare with speaker models Score combination Speech signal Output matching scores Decision Logic Output matching scores Speaker models from MFCC16, codebook size 32 User name and distance Figure 6.13 Block diagram of proposed system (LPCC+MFCC) The speech signal of the unknown speaker will be processed individually using LPCC (16th order, codebook size 32) and MFCC (16th order, codebook size 32) and compared with the corresponding codebooks of the known speaker database. The choice of codebook size and order used are based on the following reason: 1. The size of the codebook determines the complexity of the computation and based on the results achieved. Codebook size of 32 managed to achieve 93.33% for both LPCC16 and MFCC16. ENG499 CAPSTONE PROJECT REPORT 50 2. The extra computational time required for implementing such a system is negligible as compared to other methods when running in typical home PC setup using Pentium Core2 dual. The corresponding matching scores that indicate the degree of similarity between the users will be generated and combined. The weighting allocated is equal. The reason for this is due to the fact that the results for the show that LPCC and MFCC have equal recognition rates. The combine score is calculated by: Where i is the current unknown speaker and j is the known speakers residing in the database. The user with the lowest score (highest degree of similarity) for the combined scores will be returned as the identity of the unknown speaker. 6.4.2 Comparison of LPCC+MFCC vs Other Methods Figure 6.14 Overview of recognition rates using different codebook sizes ENG499 CAPSTONE PROJECT REPORT 51 Looking at figure 6.14, LPCC+MFCC managed to achieve 100% recognition rate. It achieved better recognition rate than the individual LPCC16 and MFCC16 using codebook size of 32. The extra computational time required is negligible using modern desktop PCs. Overall, MFCC performed slightly more consistent than LPC. Varying the order and codebook size in MFCC does not cause the recognition rate to fluctuate as much as LPCC. LPC performed the worst of all the methods tested. It is not surprising as LPC does not contain enough speaker specific features and is much more prone to noise than LPCC and MFCC. ENG499 CAPSTONE PROJECT REPORT 52 7. Conclusion This thesis has presented the analysis for voiceprint analysis for speaker recognition. Various pre-processing stages prior to feature extraction were studied and implemented for the prototype ASR. The prototype was developed to analyse and evaluate various voice feature extraction methods such as LPC, LPCC and MFCC for their suitability in ASR. In addition, a new method (LPCC16+MFCC16) was proposed to enhance the recognition rate of ASR by using fusion output. The results obtained have shown that LPCC and MFCC perform relatively well in speaker recognition tasks. LPCC using an order of 16 with codebook size of 128 achieved the best recognition rate of 100%. However, utilizing a codebook of 128 requires much computational processes that affect the performance of the system. LPCC also performs poorly when insufficient order is used. MFCC is more consistent than the LPCC in performing recognition task as it is less susceptible to noise and due to the fact that it is modelled after the human perception of sound. An evaluation of the performance of the fusion method using LPCC16 and MFCC16 achieved 100% accuracy using a group of 15 speakers. The result indicates that by using multiple features sets, it is possible to achieve high recognition rate using smaller size codebooks. ENG499 CAPSTONE PROJECT REPORT 53 7.1 Recommendations for further study Usage of other source of information: Combination of short time spectral analysis (1030ms), together with behavioural analysis such as accent, word duration and speaking speed for speaker identification. Robustness of LPCC+MFCC: Reliability of speaker recognition system drops drastically in noisy environments. Evaluation of the robustness of LPCC+MFCC could be conducted. Evaluate LPCC+MFCC using telephone speech: The effectiveness of the method can be further evaluated by using telephone speech for evaluation. ENG499 CAPSTONE PROJECT REPORT 54 8. Critical review and reflections “The road of life twists and turns and no two directions are ever the same. Yet our lessons come from the journey, not the destination.” This quote by American novelist Don Williams, JR, quite rightly sums up my feelings for the journey partaken for my Final Year Project. From the selection of the thesis topic that seemed simple to the actual realisation of the daunting tasks to be completed in order to complete the subject. The lessons I gained were not just in terms of academic knowledge but also of much life lessons such as time management, discipline and perseverance. The main tasks of the project were divided into smaller sections to make the daunting task more manageable. The first phase involved long periods of time in libraries and on the Internet to research for related topics to speaker recognition. Much time and effort were utilised in order to gain a better understanding in the field of speaker recognition in which I had no prior knowledge of. The many consultation sessions initiated to clear various queries of concepts and theory of speaker recognition technology with my project supervisor Dr Yuan were invaluable and critical for the lying of my basic foundation knowledge in this field. The second phase involved putting the knowledge gained from the research into practical implementation. The prototype ASR was developed using Matlab to evaluate different voiceprints. There were many issues and challenges that were encountered during this phase. The lack of practical knowledge in Matlab were crippling at times, hindering the overall progress of the project. Much time was consumed to search for help through the Internet and Matlab help forums to implement certain functions. The sense of euphoria when a Matlab ENG499 CAPSTONE PROJECT REPORT 55 function that I coded after all the hours of testing and debugging functioned in the manner I wanted made me realised that programming can be fun and exciting too. The third phase involved the actual testing and evaluation of the different voiceprints. The initial ASR had a recognition rate of 70%, which was a far cry from the 90% as indicated in various research papers. Dr Yuan pointed out that my speaker models were being built with insufficient speaker voice samples that needed to be changed in order to achieve better recognition rates. After rectifying the problem, the ASR managed to achieve results that are close to the standards. After the evaluation of the different voiceprints, the idea of merging two feature extraction methods (LPCC+MFCC) for use in ASR was inspired in one of the sessions with Dr Yuan. As always, the idea is simple, the implementation is hard. The comparison distance results of the LPCC and MFCC are different. Setting the percentage of weightage for the LPCC+MFCC requires analysis of the results for the different orders of the LPCC and MFCC. Extra effort was needed to attain the goal. All in all, the FYP to me is not just about attaining the project objectives. The most important thing is the self-actualisation process. From a seemingly uphill task to climbing the “big” mountain step by step that led to the actual completion of the project. The process is much more than just knowledge. Looking back, I am glad that I have embarked on this journey that has provided me with a chance to hone my analytical, problem solving and critical thinking skills that can only benefit me in my life and future career. ENG499 CAPSTONE PROJECT REPORT 56 9. References 1. Reynolds, D.A.,. "An overview of automatic speaker recognition technology,". Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP '02). IEEE International Conference on , vol.4, no., pp. IV-4072-IV-4075 vol.4, 2002. 2. Ayaz Keerio, Bhargav Kumar Mitra, Philip Birch, Rupert Young, and Chris Chatwin. "On Preprocessing of Speech Signals". International Journal of Signal Processing ; Vol.5 No.3 2009 [Page 216]. 3. Campbell, J. P. "Speaker Recognition",. 1999. Technical report, Department of Defence,Fort Meade.. 4. Al-Akaidi, Marwan. "Introduction to speech processing". Fractal Speech Processing. s.l. : Cambridge University Press The Edinburgh Building, Cambridge CB2 2RU, UK, 2004. 5. Saha. G., Chakroborty. S., and Senapati. S,. “A new Silence Removal and End Point Detection Algorithm for Speech and Speaker Recognition Applications” . in Proc. of Eleventh National Conference on Communications (NCC), IITKharagpur, India, January 28-30, 20. 6. Kinghorn, M. Greenwood and A. “Suving: Automaticsilence/unvoiced/voiced classification of speech,”. Departmentof Computer Science, The University ofSheffield, 1999. 7. Long, Hai-Nan and Cui-Gai Zhang. "An improved method for robust speech endpoint detection," . Machine Learning and Cybernetics, 2009 International Conference on , vol.4, no., pp.2067-2071, 12-15 July 2009. 8. Liu, Li, He, Jialong and Palm, G.,. "Signal modeling for speaker identification,". Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on , vol.2, no., pp.665-668 vol. 2, 7-10 May 1996. 9. Picone, J.W.,. "Signal modeling techniques in speech recognition," . Proceedings of the IEEE , vol.81, no.9, pp.1215-1247, Sep 1993. 10. Jayanna H S, Mahadeva Prasanna S R. "Analysis, Feature Extraction, Modeling and Testing Techniques for Speaker Recognition". IETE Tech Rev 2009;26:181-90. 11. Schroeder, M.,. "Linear prediction, entropy and signal analysis," . ASSP Magazine, IEEE , vol.1, no.3, pp. 3-11, Jul 1984. 12. Schroeder, M.,.. "Linear predictive coding of speech: Review and current directions,". Communications Magazine, IEEE , vol.23, no.8, pp. 54-61, Aug 1985. 13. Kwong, S. and Nui, P.T.,. "Design and implementation of a parametric speech coder,". Consumer Electronics, IEEE Transactions on , vol.44, no.1, pp.163-169, Feb 1998. 14. Gupta, V., Bryan, J. and Gowdy, J.,. "A speaker-independent speech-recognition system based on linear prediction,". Acoustics, Speech and Signal Processing, IEEE Transactions on , vol.26, no.1, pp. 27-33, Feb 1978. ENG499 CAPSTONE PROJECT REPORT 57 15. Atal, B. "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification.". J. Acoust. Soc. Am. 55 (6), 1304-1312. 16. Wong, E. and Sridharan, S. "Comparison of linear prediction cepstrum coefficients and mel-frequency cepstrum coefficients for language identification,. "Intelligent Multimedia, Video and Speech Processing, 2001. Proceedings of 2001 International Symposium on, vol., no., pp.95-98, 2001. 2001. 17. Md. Rashidul Hasan, Mustafa Jamil, Md. Golam Rabbani, Md. Saifur Rahman. "Speaker Identification using Mel Frequency cepstral coefficients". 3rd International Conference on Electrical & Computer Engineering ICECE 2004, 28-30 December 2004, Dhaka, Bangladesh. 18. Vergin, R.,. "An algorithm for robust signal modelling in speech recognition," . Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on , vol.2, no., pp.969-972 vol.2, 12-15 May 1998. 19. Vergin, R., O'Shaughnessy, D. and Farhat, A.,. "Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition," . Speech and Audio Processing, IEEE Transactions on , vol.7, no.5, pp.525-532, Sep 1999. 20. Buchanan, C.R. "Informatics Reseach Proposal - Modeling the Semantics of sound". 2005. 21. Seddik, H., Rahmouni, A. and Sayadi, M.,. "Text independent speaker recognition using the Mel frequency cepstral coefficients and a neural network classifier". , Communications and Signal Processing, 2004. First International Symposium on , vol., no., pp. 631-634, 2004. 2004. 22. Molau, S., et al. "Computing Mel-frequency cepstral coefficients on the power spectrum ," . Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference on , vol.1, no., pp.73-76 vol. 23. Reynolds, D.A.,. "Experimental evaluation of features for robust speaker identification," . Speech and Audio Processing, IEEE Transactions on , vol.2, no.4, pp.639-643, Oct 1994. 24. Zhonghua, Fu and Zhao Rongchun. "An overview of modeling technology of speaker recognition," . Neural Networks and Signal Processing, 2003. Proceedings of the 2003 International Conference on , vol.2, no., pp. 887-891 Vol.2, 14-17 Dec. 2003. 25. Y. Linde, A. Buzo, and R.M. Gray,. "An algorithm for vector quantizer design,". IEEE Trans. Communications, vol. COM-28(1), pp. 84-96, Jan. 1980. 26. R. Gray. "Vector quantization,". IEEE Acoust., Speech, Signal Process. Mag., vol. 1, pp. 4-29, Apr. 1984. 27. F.K. Soong, A.E. Rosenberg, L.R. Rabiner, and B.H. Juang,. "A Vector quantization approach to speaker recognition,". in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 10, Detroit, Michingon, Apr. 1985, pp. 387-90. ENG499 CAPSTONE PROJECT REPORT 58 28. Sookpotharom Supot*, Reruang Sutat**, Airphaiboon Surapan**, and Sangworasil Manas**. Medical Image Compression Using Vector Quantization and Fuzzy C-Means. [Online] http://www.kmitl.ac.th/biolab/member/sutath/final_paper_iscit02.pdf. 29. Mohamed, M.A. and Gader, P.,. "Generalized hidden Markov models. I. Theoretical frameworks,". Fuzzy Systems, IEEE Transactions on , vol.8, no.1, pp.67-81, Feb 2000. 30. Rabiner, L.R.,. "A tutorial on hidden Markov models and selected applications in speech recognition,". Proceedings of the IEEE , vol.77, no.2, pp.257-286, Feb 1989. 31. Florida, Ashish Jain John Harris Graduate Student MIL CNEL Univ. of FLorida Univ. of. "Speaker Identification using MFCC and HMM based techniques". FL : s.n., April 25, 2004. 32. R.V Pawar, P.P.Kajave, and S.N.Mali. "Speaker Identification using Neural Networks,". s.l. : World Academy of Science, Engineering and Technology , 12 2005. 33. Nikos Fakotaki, Kallirroi Georgila, and AnastasiosTsopanoglou,. “A continuous hmm text-independentspeaker recognition system based on vowel spotting,” . inEUROSPEECH’97, 1997, vol. 5, pp. 2247–2250. 34. Poonam Bansal, Amita Dev and Shail Bala Jain,. "Automatic Speaker Identification Using Vector Quantization". Amity School of Engineering and Technology, 580, Delhi Palam Vihar Road, Asian Journal of Information Technology 6 (9): 938-942, 2007 ISSN: 1682-3915, Medwell Journals, 2007. 35. C. Wutiwiwatchai, S. Sae-tang, and C. Tanprasert,. " Text-dependent Speaker Identification Using Neural Network on Distinctive Thai Tone Marks", . Proceedings of International Joint Conference on Neural Networks, July 1999. 36. Biswas, S., Ahmad, S. and Islam Mollat, M.K.,. "Speaker Identification Using Cepstral Based Features and Discrete Hidden Markov Model,". Information and Communication Technology, 2007. ICICT '07. International Conference on , vol., no., pp.303-306, 7-9 March 2. 37. Lee, K.-F. and Hon, H.-W.,. "Speaker-independent phone recognition using hidden Markov models ," . Acoustics, Speech and Signal Processing, IEEE Transactions on , vol.37, no.11, pp.1641-1648, Nov 1989. ENG499 CAPSTONE PROJECT REPORT 59 Appendix A.1 Identification rates using codebook size of 32 A.2 Identification rates using codebook size of 64 ENG499 CAPSTONE PROJECT REPORT 60 A.3 Identification rates using codebook size of 128 A.4 Identification rates using MFCC+LPCC (Codebook size 32) ENG499 CAPSTONE PROJECT REPORT 61 Figure A.3 FYP.fig Fig A.4 Feature_selection.fig ENG499 CAPSTONE PROJECT REPORT 62 Figure A.5 User_Identified.fig Figure A.6 Voice_record_FYP.fig ENG499 CAPSTONE PROJECT REPORT 63