BABU BANARSI DAS INSTITUTE OF TECHNOLOGY, GHAZIABAD (AFFILIATED TO U.P. TECHNICAL UNIVERSITY, LUCKNOW) (Established in 2000) “REAL TIME SPEAKER RECOGNITION” A project report submitted in partial fulfillment of the requirement for the Award of Degree of Bachelor of Technology In Electronics and Communication Engineering Academic session 2008-2012 Submitted By: Project guides: Rohit Singh (0803531072) Mr. Prashant Sharma Harish Kumar (0803531409) Mr. Anil Kumar Pandey Samreen Zehra (0803531074) DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING MAY 2012 1 CERTIFICATE This is to certify that Rohit Singh, Harish Kumar and Samreen Zehra of Department of Electronics and Communication Engineering of this institute have carried out together the project work presented in this report entitled „Real Time Speaker Recognition‟ in partial fulfillment of award of degree of Bachelor of Technology in Electronics and Communication Engineering from UPTU, Lucknow under our supervision. The report embodies results of their works and studies carried out by the students themselves and the content of report does not form the part of any other degree to these candidates or to anybody else. Project guides: Mr. Prashant Sharma (AP, ECE) Mr. Anil Kumar Pandey (AP, ECE) 2 ACKNOWLEDGEMENT We have taken efforts in this project. However, it would not have been possible without the kind support and help of many individuals. We would like to extend our sincere thanks to all of them. We are highly indebted to Mr. K. L. Pursnani, Head of Department Electronic and Communication Engineering, BBDIT, Ghaziabad, for his guidance and constant supervision as well as for providing necessary information regarding the project & for his support in completing the project. We would like to express our special gratitude and thanks to our project guides, Mr. Prashant Sharma and Mr. Anil Kumar Pandey for giving us such attention and time. Both of our guides had taken great pain and efforts to help us the best way without whom this project would ever be realized. We would like to express our gratitude towards our parents and friends for their kind cooperation and encouragement which help us in completion of this project. Our thanks and appreciations also go to our college in developing the project and people who have willingly helped us out with their abilities. Rohit Singh (0803531072) Harish Kumar (0803531409) Samreen Zehra (0803531074) 3 CONTENTS 1. Introduction…………………………………………………………………………...7 1.1 Project overview……………………………..…………………....………..……..7 1.2 Application…………………...………………………………...………………...10 2. Methodology……………...………………………………………….………………11 2.1 Algorithm………………………………………………...………………………11 2.2 Flow Chart……………………………………………...………………………...12 3. Identification background……………………………………………………….….13 3.1 DSP Fundamentals………………………………………………...……………..13 3.1.1 Basic Definitions………...………………………………….………...…..13 3.1.2 Convolution………………………………………………..……………...14 3.1.3 Discrete Fourier Transform………………………………...……………..14 3.2 Human Speech Production Model…………………………………...…………...15 3.2.1 Anatomy..………………………………………………….……………...15 3.2.2 Vocal Model……………………………………….…………………..….17 3.3 Speaker Identification……………………………………...……………………..18 4. Speech Feature Extraction……………………………………………………….…21 4.1 Introduction…………………………………………………………………...…..21 4.2 Short term analysis…..……………………………………………...…………....22 4.3 Cepstrum…………………………………………………………………...…….23 4.3.1 Delta Cepstrum……………..……………………………………….……24 4.4 Mel-Frequency Cepstrum Coefficients……………………………………….......25 4.4.1 Computing of mel cepstrum coefficient……………………………..……27 4.5 Framing and Windowing………..............................................................................29 5. Vector Quantization……………………………………………..………………......30 6. K-means………………………………………………………………………….…..33 6.1 Mean shift clustering…….…………………………………………...……...…...35 6.2 Bilateral filtering…….………………………………………………………........36 6.3 Speaker matching…………………..……………………………………..……….36 7. Euclidean distance………………………………………………………………...…...38 7.1 One dimension…………………………………………………………...……….40 7.2 Two dimension…………………….…………………………………..…………40 4 7.3Three dimension………….………………………………………………………..40 7.4 N dimensions………………………………………………………………….......40 7.5 Squared euclidean distance …………………....……………………...………….40 7.6 Decision making……………………………………………………...…………..41 8 Result and Conclusion…………………………………………………………………..42 8.1 Result…………………………………………………………………………......42 8.2 Pruning Basis…………………...…………………………………………...…...42 8.2.1 Static Pruning…………………………………………………………43 8.2.2 Adaptive Pruning……………………………………………………..44 8.3 Conclusion…………………………………………………………….....……….44 5 LIST OF FIGURES Name of Figure Page No 1.1 Identification Taxonomy………………………………………………………………..….8 3.1 Vocal Tract Model…………………………………………………………………….….16 3.2 Multi tube Lossless Model…………………...…………………………………..……….17 3.3 Source Filter Model……………………………………….………………….…………..18 3.4 Enrollment Phase…………………………………………………………….…………...19 3.5 Identification Phase………………………………………………………..……………...20 4.1 Short Term Analysis……………………………………………………….……………..23 4.2 Speech Magnitude Spectrum…………………………………………….……………….25 4.3 Cepstrum………………………………………………………………………….………25 4.4 Computing of mel-cepstrum coefficient……………………………………….…………27 4.5 Triangular Filter used to compute mel-cepstrum ………………………………..……….28 4.6 Quasi Stationary………………………………………………………………..…………29 5.1 Vector Quantization of Two Speakers………………………………………..…………..31 7.1 Decision Process…………………………………………………………….……………41 8.1 Distance stabilization……………………………………………………….…………….42 8.2 Evaluation of static variant using different pruning intervals…………………..………...43 8.3 Examples of the matching score distribution………………………………..……………44 6 CHAPTER 1 INTRODUCTION 1.1 Project overview The human speech conveys different types of information. The primary type is the meaning or words, which speaker tries to pass to the listener. But the other types that are also included in the speech are information about language being spoken, speaker emotions, gender and identity of the speaker. The goal of automatic speaker recognition is to extract, characterize and recognize the information about speaker identity. Speaker recognition is usually divided into two different branches: Speaker verification Speaker identification. Speaker verification task is to verify the claimed identity of person from his voice. This process involves only binary decision about claimed identity. In speaker identification there is no identity claim and the system decides who the speaking person is. Speaker identification can be further divided into two branches. Open-set speaker identification decides to whom of the registered speakers‟ unknown speech sample belongs or makes a conclusion that the speech sample is unknown. In this work, we deal with the closed-set speaker identification, which is a decision making process of who of the registered speakers is most likely the author of the unknown speech sample. Depending on the algorithm used for the identification, the task can also be divided into text-dependent and text-independent identification. The difference is that in the first case the system knows the text spoken by the person while in the second case the system must be able to recognize the speaker from any text. This taxonomy is Represented in Figure 1.1 7 Speaker Recognition Speaker verification Speaker identification Closed set Identification Text independent identification Open set Identification Text dependent identification Fig. 1.1 Identification Taxonomy Speaker recognition is basically divided into two-classification: speaker recognition and speaker identification and it is the method of automatically identify who is speaking on the basis of individual information integrated in speech waves. Speaker recognition is widely applicable in use of speaker‟s voice to verify their identity and control access to services such as banking by telephone, database access services, voice dialing telephone shopping, information services, voice mail, security control for secret information areas, and remote access to computer AT and T and TI with Sprint have started field tests and actual application of speaker recognition technology; many customers are already being used by Sprint‟s Voice Phone Card. 8 Speaker recognition technology is the most potential technology to create new services that will make our everyday lives more secured. Another important application of speaker recognition technology is for forensic purposes. Speaker recognition has been seen an appealing research field for the last decades which still yields a number of unsolved problems. The main aim of this project is speaker identification, which consists of comparing a speech signal from an unknown speaker to a database of known speaker. The system can recognize the speaker, which has been trained with a number of speakers. Below figure shows the fundamental formation of speaker identification and verification systems. Where the speaker identification is the process of determining which registered speaker provides a given speech. On the other hand, speaker verification is the process of rejecting or accepting the identity claim of a speaker. In most of the applications, voice is use as the key to confirm the identities of a speaker is classified as speaker verification. Adding the open set identification case in which a reference model for an unknown speaker may not exist can also modify above formation of speaker identification and verification system. This is usually the case in forensic application. In this circumstances, an added decision alternative, the unknown does not match any of the models, is required. Other threshold examination can be used in both verification and identification process to decide if the match is close enough to acknowledge the decision or if more speech data are needed. Speaker recognition can also divide into two methods, text-dependent and text independent methods. In text dependent method the speaker has to say key words or sentences having the same text for both training and recognition trials. Whereas in the text independent does not rely on a specific text being speaks. Formerly text dependent methods were widely in application, but later text independent is in use. Both text dependent and text independent methods share a problem however. By playing back the recorded voice of registered speakers this system can be easily deceived. There are different technique is used to cope up with such problems. Such as a small set of words or digits are used as input and each user is provoked to thorough a specified sequence of key words that is randomly selected every time the system is used. Still this method is not completely reliable. This method can be deceived with the highly developed electronics recording system that can repeat secrete key words in a request order. Practical 9 applications for automatic speaker identification are obviously various kinds of security systems. Human voice can serve as a key for any security objects, and it is not so easy in general to lose or forget it. Another important property of speech is that it can be transmitted by telephone channel, for example. This provides an ability to automatically identify speakers and provide access to security objects by telephone. Nowadays, this approach begins to be used for telephone credit card purchases and bank transactions. Human voice can also be used to prove identity during access to any physical facilities by storing speaker model in a small chip, which can be used as an access tag, and used instead of a pin code. Another important application for speaker identification is to monitor people by their voices. For instance, it is useful in information retrieval by speaker indexing of some recorded debates or news, and then retrieving speech only for interesting speakers. It can also be used to monitor criminals in common places by identifying them by voices. In fact, all these examples are actually examples of real time systems. For any identification system to be useful in practice, the time response, or time spent on the identification should be minimized. Growing size of speaker database is its major limitation. 1.2 Application Practical applications for automatic speaker identification are obviously various kinds of security systems. Human voice can serve as a key for any security objects, and it is not so easy in general to lose or forget it. Another important property of speech is that it can be transmitted by telephone channel, for example. This provides an ability to automatically identify speakers and provide access to security objects by telephone. Nowadays, this approach begins to be used for telephone credit card purchases and bank transactions. Human voice can also be used to prove identity during access to any physical facilities by storing speaker model in a small chip, which can be used as an access tag, and used instead of a pin code. Another important application for speaker identification is to monitor people by their voices. For instance, it is useful in information retrieval by speaker indexing of some recorded debates or news, and then retrieving speech only for interesting speakers. It can also be used to monitor criminals in common places by identifying them by voices. In fact, all these examples are actually examples of real time systems. For any identification system to be useful in practice, the time response, or time spent on the identification should be minimized. Growing size of speaker database is its major limitation. 10 CHAPTER 2 METHODOLOGY 2.1 Algorithm 1. Record the test sound through microphone. 2. Convert the recorded sound into .wav format. 3. Load recorded sound files from database. 4. Extract features from test file for recognition. 5. Extract features from all the files stored in the database. 6. Find out the centroids of the test file by any clustering algorithm. 7. Find out the centroids of the sample files stored in database so that codebooks can be generated. 8. Calculate the Euclidean distance between the test file and individual samples of the database. 9. Find out the sample having minimum distance with the test file. 10. The sample corresponding to the minimum distance is most likely the author of the test sound. 11 2.2 Flowchart 12 CHAPTER 3 IDENTIFICATION BACKGROUND In this chapter we discuss theoretical background for speaker identification. We start from the digital signal processing theory. Then we move to the anatomy of human voice production organs and discuss the basic properties of the human speech production mechanism and techniques for its modeling. This model will be used in the next chapter when we will discuss techniques for the extraction of the speaker characteristics from the speech signal. 3.1 DSP Fundamentals According to its abbreviation, Digital Signal Processing (DSP) is a part of computer science, which operates with special kind of data – signals. In most cases, these signals are obtained from various sensors, such as microphone or camera. DSP is the mathematics, mixed with the algorithms and special techniques used to manipulate with these signals, converted to the digital form. 3.1.1 Basic Definitions By signal we mean here a relation of how one parameter is related to another parameter. One of these parameters is called independent parameter (usually it is time), and the other one is called dependent, and represents what we are measuring. Since both of these parameters belong to the continuous range of values, we call such signal continuous signal. When continuous signal is passed through an Analog-To-Digital converter (ADC)it is said to be discrete or digitized signal. Conversion works in the following way: every time period, which occurs with frequency, called sampling frequency, signal value is taken and quantized, by selecting an appropriate value from the range of7possible values. This range is called quantization precision and usually represented as an amount of bits available to store signal value. Based on the sampling theorem, proved by Nyquist in 1940, digital signal can contain frequency components only up to one half of the sampling rate. Generally, continuous signals are what we have in nature while discrete signals exist mostly inside computers. Signals that use time as the independent parameter are said to be in the time domain, while signals that use 13 frequency as the independent parameter are said to be in the frequency domain. One of the important definitions used in DSP is the definition of linear system. By system we mean here any process that produces output signal in response on a given input signal. A system is called linear if it satisfies the following three properties: homogeneity, additivity and shift invariance. 3.1.2 Convolution An impulse is a signal composed of all zeros except one non-zero point. Every signal can be decomposed into a group of impulses, each of them then passed through a linear system and the resulting output components are synthesized or added together. The resulting signal is exactly the same as obtained by passing the original signal through the system. Every impulse can be represented as a shifted and scaled delta function, which is a normalized impulse, that is, sample number zero has a value of one and all other samples have a value of zero. When the delta function is passed through a linear system, its output is called impulse response. If two systems are different they will have different impulse responses. Scaled and shifted impulse response and scaling and shifting of the input are identical to the scaling and shifting of the output. It means that knowing systems impulse response we know everything about the system. Convolution is a formal mathematical operation, which is used to describe Relationship between three signals of interest: input and output signals, and the impulse response of the system. It is usually said that the output signal is the input signal convolved with the system‟s impulse response. Mathematics behind the convolution does not restrict how long the impulse response is. It only says that the size of the output signal is the size of the input signal plus the size of the impulse response minus one. Convolution is very important concept in DSP. Based on the properties of linear systems it provides the way of combining two signals to form a third signal. A lot of mathematics behind the DSP is based on the convolution. 3.1.3 Discrete Fourier Transform Fourier transform belongs to the family of linear transforms widely used in DSP based on decomposing signal into sinusoids (sine and cosine waves).Usually in DSP we use the Discrete Fourier Transform (DFT), a special kind of Fourier transform used to deal with a periodic discrete signals. Actually there are an infinite number of ways how signal can be 14 decomposed but sinusoids are selected because of their sinusoidal fidelity that means that sinusoidal input to the linear system will produce sinusoidal output, only the amplitude and phase may change, frequency and shape remain the same. 3.2 Human Speech Production Model Undoubtedly, ability to speak is the most important way for humans to communicate between each other. Speech conveys various kind of information, which is essentially the meaning of information speaking person wants to impart, individual information representing speaker and also some emotional filling. Speech production begins with the initial formalization of the idea which speaker wants to impart to the listener. Then speaker converts this idea into the appropriate order of words and phrases according to the language. Finally, his brain produces motor nerve commands, which move the vocal organs in an appropriate way. Understanding of how human produce sounds forms the basis of speaker identification. 3.2.1 Anatomy The sound is an acoustic pressure formed of compressions and rarefactions of air molecules that originate from movements of human anatomical structures. Most important components of the human speech production system are the lungs (source of air during speech), trachea (windpipe), larynx or its most important part vocal cords (organ of voice production), nasal cavity (nose), soft palate or velum (allows passage of air through the nasal cavity), hard palate (enables consonant articulation), tongue, teeth and lips. All these components, called articulators by speech scientists, move to different positions to produce various sounds. Based on their production, speech sounds can also be divided into consonants and voiced and unvoiced vowels. From the technical point of view, it is more useful to think about speech production system in terms of acoustic filtering operations that affect the air going from the lungs. There are three main cavities that comprise the main acoustic filter. According to they are nasal, oral and pharyngeal cavities. The articulators are responsible for changing the properties of the system and form its output. Combination of these cavities and articulators is called vocal tract. Its simplified acoustic 15 model is represented in Figure 3.1. Fig. 3.1 Vocal Tract Model Speech production can be divided into three stages: first stage is the sound source production, second stage is the articulation by vocal tract, and the third stage is sound radiation or propagation from the lips and/or nostrils. A voiced sound is generated by vibratory motion of the vocal cords powered by the airflow generated by expiration. The frequency of oscillation of vocal cords is called the fundamental frequency. Another type of sounds -unvoiced sound is produced by turbulent airflow passing through a narrow constriction in the vocal tract. In a speaker recognition task, we are interested in the physical properties of human vocal tract. In general it is assumed that vocal tract carries most of the speaker related information. However, all parts of human vocal tract described above can serve as speaker dependent characteristics. Starting from the size and power of lungs, length and flexibility of trachea and ending by the size, shape and other physical characteristics of tongue, teeth and lips. Such characteristics are called physical distinguishing factors. Another aspects of speech production that could be useful indiscriminating between speakers are called learned factors, which includes peaking rate, dialect, and prosodic effects. 16 3.2.2 Vocal Model In order to develop an automatic speaker identification system, we should construct reasonable model of human speech production system. Having such a model, we can extract its properties from the signal and, using them, we can decide whether or not two signals belong to the same model and as a result to the same speaker. Modeling process is usually divided into two parts: the excitation (or source) modeling and the vocal tract modeling. This approach is based on the assumption of independence of the source and the vocal tract models. Let us look first at the continuous-time vocal tract model called multi tube lossless model, which is based on the fact that production of speech is characterized by changing the vocal tract shape. Because the formalization of such a timevarying vocal-tract shape model is quite complex, in practice it is simplified to the series of concatenated lossless acoustic tubes with varying cross-sectional areas, as shown in Figure. This model consists of a sequence of tubes with cross-sectional areas Ak and lengths Lk. In practice the lengths of tubes assumed to be equal. If a large amount of short tubes is used, then we can approach to the continuously varying cross-sectional area, but at the cost of more complex model. Tract model serves as a transition to the more general discrete-time model, also known as source-filter model, which is shown in Figure. Fig. 3.2 Multi tube Lossless Model 17 In this model, the voice source is either a periodic pulse stream or uncorrelated white noise, or a combination of these. This assumption is based on the evidence from human anatomy that all types of sounds, which can be produced by humans, are divided into three general categories: voiced, unvoiced and combination of these two. Voiced signals can be modeled as a basic or fundamental frequency signal filtered by the vocal tract and unvoiced as a white noise also filtered by the vocal tract. Here E(z) represents the excitation function, H(z) represents the transfer function, ands(n) is the output of the whole speech production system [8]. Finally, we can think about vocal tract as a digital filter, which affects source signal and about produced sound output as a filter output. Then based on the digital filter theory we can extract the parameters of the system from its output Fig.3.3 Source Filter Model 3.3 Speaker Identification The human speech conveys different types of information. The primary type is the meaning or words, which speaker tries to pass to the listener. But the other types that are also included in the speech are information about language being spoken, speaker emotions, gender and identity of the speaker. The goal of automatic speaker recognition is to extract, characterize and recognize the information about speaker identity. Speaker recognition is usually divided into two different branches, speaker verification and speaker identification. Speaker verification task is to verify the claimed identity of person from his voice. This process involves only binary decision about claimed identity. In speaker identification there is no identity claim and the system decides who the speaking person is Speaker identification can 18 be further divided into two branches. Open-set speaker identification decides to whom of the registered speakers‟ unknown speech sample belongs or makes a conclusion that the speech sample is unknown. In this work, we deal with the closed-set speaker identification, which is a decision making process of whom of the registered speakers is most likely the author of the unknown speech sample. Depending on the algorithm used for the identification, the task can also be divided into text-dependent and text-independent identification. The difference is that in the first case the system knows the text spoken by the person while in the second case the system must be able to recognize the speaker from any text. The process of speaker identification is divided into two main phases. 1) Speaker enrollment 2) Speaker Identification During the first phase, speaker enrollment, speech samples are collected from The speakers and they are used to train their models. The collection of enrolled models is also called a speaker database. Then in the enrollment phase, these features are modeled and stored in the speaker database. This process is represented in following figure. Fig. 3.4 Enrollment Phase The next phase of speech recognition is identification phase which is shown below in the figure. In the second phase, identification phase, a test sample from an unknown speaker is compared against the speaker database. Both phases include the same first step, Feature extraction, which is used to extract speaker dependent characteristics from speech. The main purpose of this step is to reduce the amount of test data while retaining speaker discriminative information. 19 Fig. 3.5 Identification Phase However, these two phases are closely related. For instance, identification algorithm usually depends on the modeling algorithm used in the enrollment phase. This thesis mostly concentrates on the algorithms in the identification phase and their optimization. 20 CHAPTER 4 SPEECH FEATURE EXTRACTION 4.1 Introduction The acoustic speech signal contains different kind of information about speaker. This includes “high-level” properties such as dialect, context, speaking style, emotional state of speaker and many others. A great amount of work has been already done in trying to develop identification algorithms based on the methods used by humans to identify speaker. But these efforts are mostly impractical because of their complexity and difficulty in measuring the speaker discriminative properties used by humans. More useful approach is based on the “low-level” properties of the speech signal such as pitch (fundamental frequency of the vocal cord vibrations), intensity, formant frequencies and their bandwidths, spectral correlations, short-time spectrum and others. From the automatic speaker recognition task point of view, it is useful to think about speech signal as a sequence of features that characterize both the speaker as well as the speech. It is an important step in recognition process to extract sufficient information for good discrimination in a form and size which is amenable for effective modeling. The amount of data, generated during the speech production, is quite large while the essential characteristics of the speech process change relatively slowly and therefore, they require less data. According to these matters feature extraction is a process of reducing data while retaining speaker discriminative information. When dealing with speech signals there are some criteria that the extracted features should meet. Some of them are listed below: discriminate between speakers while being tolerant of intra-speaker variability, easy to measure, stable over time, occur naturally and frequently in speech, change little from one speaking environment to another, Not be susceptible to mimicry. 21 For MFCC feature extraction, we use the melcepst function from Voicebox. Because of its nature, the speech signal is a slowly varying signal or quasi-stationary. It means that when speech is examined over a sufficiently short period of time (20-30 milliseconds) it has quite stable acoustic characteristics. It leads to the useful concept of describing human speech signal, called “short-term analysis”, where only a portion of the signal is used to extract signal features at one time. It works in the following way: predefined length window (usually 20-30 milliseconds) is moved along the signal with an overlapping (usually 30-50% of the window length) between the adjacent frames. Overlapping is needed to avoid losing of information. Parts of the signal formed in such a way are called frames. In order to prevent an abrupt change at the end points of the frame, it is usually multiplied by a window function. The operation of dividing signal into short intervals is called windowing and such segments are called windowed frames (or sometime just frames). There are several window functions used in speaker recognition area but the most popular is Hamming window function, which is described by the following equation: 2𝑛𝜋 𝑊 𝑛 = 0.54 − 0.46cos ( ) 𝑁−1 Where N is the size of the window or frame. A set of features extracted from one frame is called feature vector. 4.2 Short Term Analysis Frame1, Frame2, Frame3, … Frame N the speech signal is slowly varying over time (quasistationary), that is when the signal is examined over a short period of time (5-100msec), the signal is fairly stationary. Therefore speech signals are often analyzed in short time segments, which is referred to as short-time spectral analysis. 22 Fig. 4.1 Short Term Analysis 4.3 Cepstrum The speech signal s(n) can be represented as a “quickly varying” source signal e(n) convolved with the “slowly varying” impulse response h(n) of the vocal tract represented as a linear filter. We have access only to the output (speech signal) and it is often desirable to eliminate one of the components. Separation of the source and the filter parameters from the mixed output is in general difficult problem when these components are combined using not linear operation, but there are various techniques appropriate for components combined linearly. The cepstrum is representation of the signal where these two components are resolved into two additive parts. It is computed by taking the inverse DFT of the logarithm of the magnitude spectrum of the frame. This is represented in the following equation: 23 𝐶𝑒𝑝𝑠𝑡𝑟𝑢𝑚 𝑓𝑟𝑎𝑚𝑒 = 𝐼𝐷𝐹𝑇 log 𝐷𝐹𝑇 𝑓𝑟𝑎𝑚𝑒 Some explanation of the algorithm is therefore needed. By moving to the frequency domain we are changing from the convolution to the multiplication. Then by taking logarithm we are moving from the multiplication to the addition. That is desired division into additive components. Then we can apply linear operator inverse DFT, knowing that the transform will operate individually on these two parts and knowing what Fourier transform will do with quickly varying and slowly varying parts. Namely it will put them into different, hopefully separate parts in new, also called quefrency axis. 4.3.1 Delta Cepstrum The cepstral coefficients provide a good representation of the local spectral properties of the framed speech. But, it is well known that a large amount of information resides in the transitions from one segment of speech to another. An improved representation can be obtained by extending the analysis to include information about the temporal cepstral derivative. Delta Cepstrum is used to catch the changes between the different frames. Delta Cepstrum defined as: ∆𝐶𝑠 𝑛; 𝑚 = 1 {𝐶 𝑛; 𝑚 + 1 − 𝐶𝑠 (𝑛; 𝑚 − 1)} 2 𝑠 𝑖 = 1, … , 𝑄 The results of the feature extraction are a series of vectors characteristic of the timevarying spectral properties of the speech signal. 24 Fig.4.2 Speech Magnitude Spectrum We can see that the speech magnitude spectrum is combined from slow and quickly varying parts. But there is still one problem: multiplication is not a linear operation. We can solve it by taking logarithm from the multiplication as described earlier. Finally, let us look at the result of the inverse DFT in Figure. Fig.4.3Cepstrum We can see that two components are clearly distinctive now. 4.4 Mel-Frequency Cepstrum Coefficients In this project we are using Mel Frequency Cepstral Coefficient. Mel frequency Cepstral Coefficients are coefficients that represent audio based on perception. This coefficient has a great success in speaker recognition application. It is derived from the Fourier Transform of the audio clip. In this technique the frequency bands are positioned logarithmically, whereas in the Fourier Transform the frequency bands are not positioned logarithmically. As the frequency bands are positioned logarithmically in MFCC, it approximates the human 25 system response more closely than any other system. These coefficients allow better processing of data. In the Mel Frequency Cepstral Coefficients the calculation of the Mel Cepstrum is same as the real Cepstrum except the Mel Cepstrum‟s frequency scale is warped to keep up a correspondence to the Mel scale. The Mel scale was projected by Stevens, Volkmann and Newman in 1937. The Mel scale is mainly based on the study of observing the pitch or frequency Perceived by the human. The scale is divided into the units mel. In this test the listener or test person started out hearing a frequency of 1000 Hz, and labeled it 1000 Mel for reference. Then the listeners were asked to change the frequency till it reaches to the frequency twice the reference frequency. Then this frequency labeled 2000 Mel. The same procedure repeated for the half the frequency, then this frequency labeled as 500 Mel, and so on. On this basis the normal frequency is mapped into the Mel frequency. The Mel scale is normally a linear mapping below 1000 Hz and logarithmically spaced above 1000 Hz. Figure below shows the example of normal frequency is mapped into the Mel frequency. Mel-frequency cepstrum coefficients (MFCC) are well known features used to describe speech signal. They are based on the known evidence that the information carried by lowfrequency components of the speech signal is phonetically more important for humans than carried by high-frequency components. Technique of computing MFCC is based on the shortterm analysis, and thus from each frame a MFCC vector is computed. MFCC extraction is similar to the cepstrum calculation except that one special step is inserted, namely the frequency axis is warped according to the mel-scale. Summing up, the process of extracting MFCC from continuous speech is illustrated in Figure. A “mel” is a unit of special measure or scale of perceived pitch of a tone. It does not correspond linearly to the normal frequency, indeed it is approximately linear below 1 kHz and logarithmic above. 26 4.4.1Computing of melcepstrum coefficients Fig. 4.4 Computing of mel-cepstrum coefficient Figure above shows the calculation of the Mel Cepstrum Coefficients. Here we are using the bank filter to warping the Mel frequency. Utilizing the bank filter is much more convenient to do Mel frequency warping, with filters centered according to Mel frequency. According to the Mel frequency the width of the triangular filters vary and so the log total energy in a critical band around the center frequency is included. After warping are a number of coefficients. Finally we are using the Inverse Discrete Fourier Transformer for the cepstral coefficients calculation. In this step we are transforming the log of the quefrench domain coefficients to the frequency domain. Where N is the length of the DFT we used in the cepstrum section. To place more emphasize on the low frequencies one special step before inverse DFT in calculation of cepstrum is inserted, namely mel-scaling. A “mel” is a unit of special measure or scale of perceived pitch of a tone. It does not correspond linearly to the normal frequency, indeed it is approximately linear below 1 kHz and logarithmic above. This approach is based on the psychophysical studies of human perception of the frequency content of sounds. One 27 useful way to create mel-spectrum is to use a filter bank, one filter for each desired melfrequency component. Every filter in this bank has triangular band pass frequency response. Such filters compute the average spectrum around each center frequency with increasing bandwidths, as displayed in Figure. Fig. 4.5 Triangular Filter used to compute mel-cepstrum This filter bank is applied in frequency domain and therefore, it simply amounts to taking these triangular filters on the spectrum. In practice the last step of taking inverse DFT is replaced by taking discrete cosine transform(DCT) for computational efficiency. The number of resulting mel-frequency cepstrum coefficients is practically chosen relatively low, in the order of 12 to 20 coefficients. The zeroth coefficient is usually dropped out because it represents the average log-energy of the frame and carries only a little speaker specific information. However, MFCC are not equally important in speaker identification and thus some coefficients weighting might by applied to acquire more precise result. Different approach to the computation of MFCC than described in this work is represented in that is simplified by omitting filter bank analysis. 28 4.5 Framing and Windowing The speech signal is slowly varying over time (quasi-stationary) that is when the signal is examined over a short period of time (5-100msec), the signal is fairly stationary. Therefore speech signals are often analyzed in short time segments, which are referred to as short-time spectral analysis. This practically means that the signal is blocked in frames of typically 20-30 msec. Adjacent frames typically overlap each other with 30-50%, this is done in order not to lose any information due to the windowing. Fig 4.6 Quasi Stationary After the signal has been framed, each frame is multiplied with a window function w(n) with length N, where N is the length of the frame. Typically the Hamming window is used: 2𝜋𝑛 𝑊 𝑛 = 0.54 − 0.46cos ( ) 𝑁−1 Where 0 ≤ 𝑛 ≤ 𝑁 − 1 The windowing is done to avoid problems due to truncation of the signal as windowing helps in the smoothing of the signal. 29 CHAPTER 5 VECTOR QUANTIONZATION A speaker recognition system must able to estimate probability distributions of the computed feature vectors. Storing every single vector that generate from the training mode is impossible, since these distributions are defined over a high-dimensional space. It is often easier to start by quantizing each feature vector to one of a relatively small number of template vectors, with a process called vector quantization. VQ is a process of taking a large set of feature vectors and producing a smaller set of measure vectors that represents the centroids of the distribution. The technique of VQ consists of extracting a small number of representative feature vectors as an efficient means of characterizing the speaker specific features. By means of VQ, storing every single vector that we generate from the training is impossible. Vector quantization (VQ) is a process of mapping vectors from a vector space to a finite number of regions in that space. These regions are called clusters and represented by their central vectors or centroids. A set of centroids, which represents the whole vector space, is called a codebook. In speaker identification, VQ is applied on the set of feature vectors extracted from the speech sample and as a result, the speaker codebook is generated. Such codebook has a significantly smaller size than extracted vector set and referred as a speaker model. Actually, there is some disagreement in the literature about approach used in VQ. Some authors consider it as a template matching approach because VQ ignores all temporal variations and simply uses global averages (centroids). Other authors consider it as a stochastic or probabilistic method, because VQ uses centroids to estimate the modes of a probability distribution. Theoretically it is possible that every cluster, defined by its centroid, models particular component of the speech. But practically, however, VQ creates unrealistically clusters with rigid boundaries in a sense that every vector belongs to one and only one cluster. Mathematically a VQ task is defined as follows: given a set of feature vectors, find a partitioning of the feature vector space into the predefined number of regions, which do not overlap with each other and added together form the whole feature vector space. Every vector inside such region is represented by the corresponding centroid. The process of VQ for two speakers is represented in Figure. 30 Fig. 5.1 Vector Quantization of Two Speakers There are two important design issues in VQ: the method for generating the codebook and codebook size. Known clustering algorithms for codebook generation are: Generalized Lloyd algorithm (GLA), Self-organizing maps (SOM), Pair wise nearest neighbor (PNN), Iterative splitting technique (SPLIT), Randomized local search (RLS). K means According to, iterative splitting technique should be used when the running time is important but RLS is simpler to implement and generates better codebooks in the case of speaker 31 identification task. Codebook size is a trade-off between running time and identification accuracy. With large size, identification accuracy is high but at the cost of running time and vice versa. Experimental result obtained in is that saturation point choice is 64 vectors in codebook. The quantization distortion (quality of quantization) is usually computed as the sum of squared distances between vector and its representative (centroid). The well-known distance measures are Euclidean, city block distance. 32 CHAPTER 6 K MEANS The K-means algorithm partitions the T feature vectors into M centroids. The algorithm first chooses M cluster-centroids among the T feature vectors. Then each feature vector is assigned to the nearest centroid, and the new centroids are calculated. This procedure is continued until a stopping criterion is met, that is the mean square error between the feature vectors and the cluster-centroids is below a certain threshold or there is no more change in the cluster-center assignment. In data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results into a partitioning of the data space into cells. The problem is computationally difficult (NP-hard), however there are efficient heuristic algorithms that are commonly employed and converge fast to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data, however k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes. The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the k-means algorithm; it is also referred to as Lloyd's algorithm, particularly in the computer science community. Given an initial set of k means m1(1),…,mk(1) (see below), the algorithm proceeds by alternating between two steps: 33 (𝑡) Where each 𝑥𝑝 goes into exactly one 𝑆𝑖 , even if it could go in two of them. Update step: Calculate the new means to be the centroid of the observations in the cluster. The algorithm is deemed to have converged when the assignments no longer change. Commonly used initialization methods are Forgy and Random Partition. The Forgy method randomly chooses k observations from the data set and uses these as the initial means. The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the Update step, thus computing the initial means to be the centroid of the cluster's randomly assigned points. The Forgy method tends to spread the initial means out, while Random Partition places all of them close to the center of the data set. According to Hamerly et al., the Random Partition method is generally preferable for algorithms such as the k-harmonic means and fuzzy k-means. For expectation maximization and standard k-means algorithms, the Forgy method of initialization is preferable. Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS): Where μi is the mean of points in Si. The two key features of k-means which make it efficient are often regarded as its biggest drawbacks: Euclidean distance is used as a metric and variance is used as a measure of cluster scatter. The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set. A key limitation of k-means is its cluster model. The concept is based on spherical clusters that are separable in a way so that the mean value converges towards the cluster center. The clusters are expected to be of similar size, so that the assignment to the nearest cluster center 34 is the correct assignment. When for example applying k-means with a value of onto the well-known Iris flower data set, the result often fails to separate the three Iris species contained in the data set. With , the two visible clusters (one containing two species) will be discovered, whereas with parts. In fact, one of the two clusters will be split into two even is more appropriate for this data set, despite the data set containing 3 classes. As with any other clustering algorithm, the k-means result relies on the data set to satisfy the assumptions made by the clustering algorithms. It works very well on some data sets, while failing miserably on others. The result of k-means can also be seen as the Voronoi cells of the cluster means. Since data is split halfway between cluster means, this can lead to suboptimal splits as can be seen in the "mouse" example. The Gaussian models used by the Expectation-maximization algorithm (which can be seen as a generalization of k-means) are more flexible here by having both variances and covariance. The EM result is thus able to accommodate clusters of variable size much better than k-means as well as correlated cluster. k-means clustering in particular when using heuristics such as Lloyd's algorithm is rather easy to implement and apply even on large data sets. As such, it has been successfully used in various topics, ranging from market segmentation, computer vision, geostatistics and astronomy to agriculture. It often is used as a preprocessing step for other algorithms, for example to find a starting configuration. k-means clustering, and its associated expectation-maximization algorithm, is a special case of a Gaussian mixture model, specifically, the limit of taking all covariance as diagonal, equal, and small. It is often easy to generalize a k-means problem into a Gaussian mixture model. 6.1Mean shift clustering Basic mean shift clustering algorithms maintain a set of data points the same size as the input data set. Initially, this set is copied from the input set. Then this set is iteratively replaced by the mean of those points in the set that are within a given distance of that point. By contrast, k-means restricts this updated set to k points usually much less than the number of points in the input data set, and replaces each point in this set by the mean of all points in the input set that are closer to that point than any other (e.g. within the Voronoi partition of each updating point). A mean shift algorithm that is similar then to k-means, called likelihood mean shift, replaces the set of points undergoing replacement by the mean of all points in the 35 input set that are within a given distance of the changing set. One of the advantages of mean shift over k-means is that there is no need to choose the number of clusters, because mean shift is likely to find only a few clusters if indeed only a small number exist. However, mean shift can be much slower than k-means. Mean shift has soft variants much as k-means does. 6.2 Bilateral filtering k-means implicitly assumes that the ordering of the input data set does not matter. The bilateral filter is similar to K-means and mean shift in that it maintains a set of data points that are iteratively replaced by means. However, the bilateral filter restricts the calculation of the (kernel weighted) mean to include only points that are close in the ordering of the input data. This makes it applicable to problems such as image denoising, where the spatial arrangement of pixels in an image is of critical importance. 6.3 Speaker Matching During the matching a matching score is computed between extracted feature vectors and every speaker codebook enrolled in the system. Commonly it is done as a partitioning extracted feature vectors, using centroids from speaker codebook, and calculating matching score as a quantization distortion. Another choice for matching score is mean squared error (MSE), which is computed as the sum of the squared distances between the vector and nearest centroid divided by number of vectors extracted from the speech sample. The quantization distortion (quality of quantization) is usually computed as the sum of squared distances between vector and its representative (centroid). The well-known distance measures are Euclidean, city block distance, weighted Euclidean and Mahalanobis. They are represented in the following equations: 36 Where x and y are multi-dimensional feature vectors and D is a weighting matrix. When D is a covariance matrix weighted Euclidean distance also called Mahalanobis distance. Weighted Euclidean distance where D is a diagonal matrix and consists of diagonal elements of covariance matrix is more appropriate, in a sense that it provides more accurate identification result. The reason for such result is that because of their nature not all components in feature vectors are equally important and weighted distance might give more precise result. 37 CHAPTER 7 EUCLIDEAN DISTANCE In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space (or even any inner product space) becomes a metric space. The associated norm is called the Euclidean norm. Older literature refers to the metric as Pythagorean metric. The Euclidean distance between points p and q is the length of the line segment connecting them ( ). In Cartesian coordinates, if p = (p1, p2,..., pn) and q = (q1, q2,..., qn) are two points in Euclidean n-space, then the distance from p to q, or from q to p is given by: The position of a point in a Euclidean n-space is a Euclidean vector. So, p and q are Euclidean vectors, starting from the origin of the space, and their tips indicate two points. The Euclidean norm, or Euclidean length, or magnitude of a vector measures the length of the vector: Where the last equation involves the dot product. A vector can be described as a directed line segment from the origin of the Euclidean space (vector tail), to a point in that space (vector tip). If we consider that its length is actually the distance from its tail to its tip, it becomes clear that the Euclidean norm of a vector is just a special case of Euclidean distance: the Euclidean distance between its tail and its tip. 38 The distance between points p and q may have a direction (e.g. from p to q), so it may be represented by another vector, given by: In a three-dimensional space (n=3), this is an arrow from p to q, which can be also regarded as the position of q relative to p. It may be also called a displacement vector if p and q represent two positions of the same point at two successive instants of time. The Euclidean distance between p and q is just the Euclidean length of this distance (or displacement) vector: Which is equivalent to equation 1, and also to: 7.1One dimension In one dimension, the distance between two points on the real line is the absolute value of their numerical difference. Thus if x and y are two points on the real line, then the distance between them is given by: In one dimension, there is a single homogeneous, translation-invariant metric (in other words, a distance that is induced by a norm), up to a scale factor of length, which is the Euclidean distance. In higher dimensions there are other possible norms. 39 7.2 Two dimensions In the Euclidean plane, if p = (p1, p2) and q = (q1, q2) then the distance is given by This is equivalent to the Pythagorean theorem. Alternatively, it follows from (2) that if the polar coordinates of the point p are (r1, θ1) and those of q are (r2, θ2), then the distance between the points is 7.3 Three dimensions In three-dimensional Euclidean space, the distance is 7.4 N dimensions In general, for an n-dimensional space, the distance is 7.5 Squared Euclidean Distance The standard Euclidean distance can be squared in order to place progressively greater weight on objects that are further apart. In this case, the equation becomes 40 Squared Euclidean Distance is not a metric as it does not satisfy the triangle inequality, however it is frequently used in optimization problems in which distances only have to be compared. It is also referred to as quadrance within the field of rational trigonometry. 7.6 Decision making The next step after computing of matching scores for every speaker model enrolled in the system is the process of assigning the exact classification mark for the input speech. This process depends on the selected matching and modeling algorithms. In template matching, decision is based on the computed distances, whereas in stochastic matching it is based on the computed probabilities. Fig. 7.1 Decision Process In the recognition phase an unknown speaker, represented by a sequence of feature vectors is compared with the codebooks in the database. For each codebook a distortion measure is computed, and the speaker with the lowest distortion is chosen as the most likely speaker. 41 CHAPTER 8 RESULT AND CONCLUSION 8.1 Result We show the results of our experiments. Every chart is preceded by the short explanation what we measured and why and followed by the short discussion about results. 8.2 Pruning Basis First, we start from the basis for speaker pruning. We ran a few tests to see how the matching function behaves during the identification and when it stabilizes. Chart in Figure 7.1 represents the variation of matching score depending on the available test vectors for 20 different speakers. One vector refers to the one analysis frame. Fig: 8.1 Distance stabilization In this figure, the bold line represents the owner of the test sample. It can be seen that at the beginning the matching score for correct speaker is somewhere among the other scores. But after large enough amount of new vectors extracted from the test speech, it becomes close with only few score sand at the end it becomes the smallest score. Actually, this is the 42 underlying reason for speaker pruning: when we have more data we can drop some of the models from identification. 8.2.1 Static pruning In the next experiment, we consider the trade-off between identification error rate and average time spent on the identification for static pruning. By varying the pruning interval or number of pruned speakers we expect different error rates and different identification times. From several runs with different parameter combination we can plot the error rate as a function of average identification time. To obtain such dependency, we fixed three values for pruning interval and were varying the number of pruned speakers. The results are shown in Figure 7.3. Fig: 8.2Evaluation of static variant using different pruning intervals From this figure we can see that all curves follow almost the same shape. It is because in order to have fast identification we have to choose either small pruning interval or large number of pruned speakers. On the other hand, in order to have low error rate we have to choose large interval or small number of pruned speakers. The main conclusion from this figure is that these two parameters compensate each other. 43 8.2.2 Adaptive pruning The idea of adaptive pruning is based on the assumption that distribution of matching score follows more or less the Gaussian curve. In Figure 7.4 we can see the distributions of matching scores for two typical identifications. Fig: 8.3 Examples of the matching score distribution From this figure we can see that distribution is not exactly follows the Gaussian curve but its shape is almost the same. In the next experiment, we consider the trade-off between identification error rate and average time spenton the identification for adaptive pruning. By fixing the parameter ŋ and varying the pruning interval we obtained desired dependency. 8.3 Conclusion The goal of this project was to create a speaker recognition system, and apply it to a speech of an unknown speaker. By investigating the extracted features of the unknown speech and then compare them to the stored extracted features for each different speaker in order to identify the unknown speaker. 44 The feature extraction is done by using MFCC (Mel Frequency Cepstral Coefficients). The function „melcepst‟ is used to calculate the melcepstrum of a signal. The speaker was modeled using Vector Quantization (VQ). A VQ codebook is generated by clustering the training feature vectors of each speaker and then stored in the speaker database. In this method, the K means algorithm is used to do the clustering. In the recognition stage, a distortion measure which based on the minimizing the Euclidean distance was used when matching an unknown speaker with the speaker database. During this project, we have found out that the VQ based clustering approach provides us with the faster speaker identification process. 45