Digital Signal Processing (Term Project) by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Speaker Recognition Introduction What is Speaker Recognition? A process that automatically recognizes, who is speaking on the basis of individual information included in the speech waves Speaker Recognition Words “Who are you?” Speech Signal Speaker Recognition Speaker Recognition System Goals The goal of this project is to build a simple, yet complete and representative ‘speaker recognition system ‘. • The system should be able to identify speakers based on the different voice characteristics of each of the known speakers. • This identification should be accomplished regardless of the sentence spoken (Text independent). • Speaker Recognition Basic Structure of Speaker Recognition System Speaker Identification / Speaker Verification Speaker Recognition Principle of speaker Recognition system Introduction All speaker Recognition systems have to serve two distinguished phases. • Enrollment or Training phase • Testing phase In training phase each registered speaker has to provide samples of their speech so that the system can build a reference model for that speaker In testing the input speech is matched with stored reference model(s) and recognition decision is made Speaker Recognition Basic structure of speaker Recognition system Feature Extraction / Feature Matching Speaker Recognition MFCC Processor Block diagram • Continuous signal is blocked into frames of N samples. Frame Blocking • Windowing Windowing the frames minimize the signal discontinuities at the beg & end of each frame • Windowing minimize spectral distortion to taper the signal to zero at beg. & end of each frame. • y[n]=x[n]w[n] • Fourier Transform 0 n N 1 spectrum Typically Hamming window is used which has the 2 n w[n] 0.54 0.46 cos 0 n N 1 N 1 N 1 • 1st fram consists of N samples • 2nd frame begins M samples after the 1st & overlap it N-M samples and so on • Typically N=256(radix 2 FFT), M=100 FFT X [k ] x[n]e 2kn N 0 n N 1, 0 k N 1 Mel cepstrum Mel Mel freq. Wrapping Cepstrum spectrum n 0 • Cosine Transform (Mel Cepstrum) K ~ ~ 1 cn log Sk cos k n 1,2,3..K 2 K k 1 Speaker Recognition Speech Production A Convolution Process • Speech can be modeled as convolution between • Glottal exitation source g[n] & A vocal tract impulse response v[n] • • y[n] =g[n]*v[n] Speaker Recognition Cepstrum A transformation • • • It is believed that vocal tract characterstics are important to speech & speaker recognition. We would like to separate out this filtered response. Cepstrum does this & converts multiplication (convolution in time) Y( )=g( )v( ) to sum Y~( )=log[g( )]+log[v( )] Speaker Recognition Mel Cepstrum Mimicing the behaviour of human ear Speaker Recognition Mel filter bank linear spacing below 1kHz, log. Scale above 1kHz • Triangular shaped filters emphasize center i frequency and span to the next center frequency. • Thus for each tone with actual freq. in Hz. a subjective pitch is measured on Mel scale mel(f)= 2595*log10(1+f / 700) • (Fant’s expresion) Speaker Recognition Part 2 Speaker Verification Speaker Recognition Speaker Verification Feature Matching • Clasification of objects of interest into patterns or acoustic vectors extracted from input speech • Since the classification is applied on extracted features, the process can also be reffered to as feature matching • Various feature maching techniques DTW,HMM & VQ etc • Vector Quantization is a process of mapping vectors from a large vector space to a small number of regions in space . • Each region is called a cluster and is represented by its center called a ‘codeword’. • The collection of all the ‘codewords’ is called a codebook. Speaker Recognition Vector Quantization The codebook • Speaker Recognition Vector Quantisation (The LBG algorithm) • Speaker Recognition