International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 7 - Jun 2014 Isolated Digit Recognition Using in Ear Microphone Data Using MFCC,VQ and HMM Mahesh K. Patil1, Prof. Dr.(Mrs.) L.S. Admuthe2 , Prashant P. Zirmite3 1,3 PG student, Department of Electronics Engineering, Textile & Engineering Institute, Ichalkaranji 2 Head Department of Electronics Engineering, Textile & Engineering Institute, Ichalkaranji. Abstract – This paper implements the isolated digit recognizer in three steps. The first step performs the endpoint detection and speech segmentation by short term analysis. The second step involves the speech feature extraction by using Mel Frequency Cepstral Coefficients (MFCC) parameters. Finally the third step involves the Vector Quantization (VQ) – Hidden Markov Model (HMM) based classifier for isolated digit recognition. An English spoken digit database which contains number of speakers is used for the testing and validation modules. Keywords— Speaker recognition, MFCC, Quantization, Hidden Markov Model. Vector I. INTRODUCTION Speech is the natural and fundamental way of communication for the humans. Technically speaking the speech recognition refers to the mechanism that stores some representation of distinguishing characteristics of speech with a source of input equipment, such as a microphone and further processes these representations to match them to incoming speech in an effort to interact with machines, computers and/or human users. In an isolated digit recognition system, the digits are spoken in isolation with a distinct pause between digits. This is the simplest form of recognition strategy since an endpoint detector can be easily used to recognize the boundaries of the digits. The pauses between the digits typically are on the order of 200ms so that they are not confused with weak fricatives and gaps within digits. Furthermore the user must be cooperative to make the isolated digit recognition system successful. There are two major stages within isolated word recognition- a training stage and a testing stage. Training stage involves teaching the system by building its dictionary (database) i.e. an acoustic model for each word that the system needs to recognize. Testing stage involves ensure of matching of input test speech signal with the reference database and recognize the same. During this testing phase, ISSN: 2231-5381 similar feature vectors are extracted from the input test speech signal, and the degree of their match with reference is obtained using some matching technique. II. BLOCK DIAGRAM Reference speech samples Feature Extraction MFCC Trained database Training stage Input from User Feature Extraction MFCC Feature matching using VQ Testing Stage HMM Classification Fig.1 System Block Diagram Fig.1 shows the proposed flow diagram of the system. In the training stage, different sample of single word will be taken from a speaker, such number of different words will be considered. Feature vectors (set of MFCC’s) will be calculated for this input reference samples. For each word its Hidden Markov Model will be formed, based on observed feature vectors. In the testing stage, real time input can be given to the system. It’s feature vectors will matched with feature vectors from database using vector quantization. This word recognition will be classified using Hidden Markov Model as the recognition algorithm. III. FEATURE E XTRACTION Fig.2 shows the block diagram of MFCC. The MFCC is the most evident example of a feature set that is extensively used in speech recognition. As the frequency bands are positioned logarithmically in MFCC, it approximates the human system response more closely than any other system. The MFCC is computed based on short term analysis and from each http://www.ijettjournal.org Page 322 International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 7 - Jun 2014 frame a MFCC vector is computed. In order to extract the coefficients the speech sample is taken as the input and hamming window is applied to minimize the discontinuities of a signal. Then DFT will be used to generate the Mel filter bank. Mel frequency warping is used to get log total energy. After warping the number of coefficients are obtained. Finally the Discrete Cosine Transform is applied to it as if it were a signal. The MFCC are the amplitudes of the resulting spectrum. N=256 (which is equivalent to 30ms windowing) and M=100. Windowing- Here each individual frame from above step is windowed in order to minimize signal discontinuities and spectral distortion. The Hamming window is used. Frame size- typically 1025ms, frame shift is 5-10ms. The hamming window is given as below( ) = 0.54 − 0.46 (2П ⁄ − 1) …(1) where 0 ≤ n ≤ N-1 Input Speech Signal 3. Discrete Fourier TransformIt converts each frame of N samples from the time domain into the frequency domain. It gives out the magnitude frequency response of each frame. The N point DFT is given by- Framing and Windowing [ ]=∑ DFT [ ] П ...(2) where k= 0,1,2,---------(N-1) Mel Frequency Warping 4. LOG IDFT (Inverse DFT) Mel Cepstrum MFCC Feature Vector Fig. 2 Steps involved in MFCC Feature Extraction Steps involved in MFCC calculation are as shown below 1. Pre-emphasisPre-emphasis means boosting the energy in the high frequencies. To enhance the accuracy and efficiency of the feature extraction process, speech samples are normally pre-processed before features are extracted. It includes digital filtering and speech signal detection. Filtering includes pre-emphasis filter and filtering out any noise using several algorithms of digital filtering. Normally a one coefficient FIR filter is known as pre emphasis filter. 2. Framing and windowingFraming- In this the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M (M<N). It continues until all the speech is accounted for within one or more frames. Typical values for N and M are ISSN: 2231-5381 Mel Frequency WarpingHuman perception of the frequency contents of sounds for speech signals does not follow a linear scale. For each tone with an actual frequency f Hz, a subjective pitch is measured on the mel scale. The formula to compute the Mels for a given frequency f Hz is given as- ( ) = 2595 ∗ log (1 + ) …(3) 5. LOG energy computationAs human’s ear’s response for speech signal is a logarithmic in nature, we have to compute the Log values for each Mel value. 6. Inverse Discrete Fourier Transform (IDFT) It converts each frame of N samples from the frequency domain to again in time domain. It expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. 7. Mel CepstrumThe cepstral representation of the speech spectrum provides a good representation of the local spectral properties of the signal for the given frame analysis, because the Mel spectrum coefficients are real numbers. In this the Mel-spectrum scale is converted back to standard frequency scale. IV. FEATURE MATCHING- VECTOR QUANTIZATION Feature matching technique based on Vector Quantization is proposed here. The acoustic vectors formed by set of MFCC’s are clustered together. For each clustered vector its center is calculated called a http://www.ijettjournal.org Page 323 International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 7 - Jun 2014 codeword. The collection of all codewords is used to generate a codebook. Thus in training phase, codebook is generated for each input speech sample by clustering one’s training acoustic vectors. In the testing phase, an input utterance of an unknown speech is vector quantized using each trained codebook and the vector quantization distortion is computed. The unknown speech sample corresponding to the VQ codebook with smallest distortion is identified. It can be implemented by either making use of K-means algorithm or LBG algorithm. Algorithm will be implemented as follows1. Design a 1-vector codebook. This is the centroid for the entire set of training vectors. 2. Double the size of the codebook by splitting the current codebook according to the rule= ( ) = ( ) Where n varies from 1 to the current size of the codebook and c is the splitting parameter (0.01≤ c ≤ 0.05) 3. Use the iterative algorithm (k-means algorithm or LBG algorithm) the best suitable to get the best set of centroids for the split codebook. 4. Iterate steps 2 and 3 until a defined codebook size is designed. The output of VQ is the index of the codebook vector nearest to the input vector. V. SPEECH RECOGNITION CLASSIFIER USING HIDDEN MARKOV MODEL A HMM is a finite state machine having fixed number of states. It is a statistical method of characterizing the spectral properties of the frames of a speech signal. In HMM, states of model cannot be observed directly. Only the output symbols of the state can be observed. Speech is described by parameterized statistical models and recognition is performed by considering the most likely model which produces the given sequence of observations. HMM is characterized by three matrices namely λ=(A,B,П) A is Transition probability matrix (NxN) B is Observation symbol probability distribution matrix (NxM) П is Initial state distribution matrix (Nx1) Where, N- Number of states in HMM M- Number of distinct observation per words per state. Complete parameter set λ= (A,B,П) can be calculated for each word model. ISSN: 2231-5381 λ1 HMM for Digit 1 Probability Computation Spoken Digit Input λ2 Vector Quantized MFCC’s HMM for Digit 2 Select maximum Probability Computation λv HMM for Digit v Recognized Digit Probability Computation Fig.3 HMM based speech recognition. VI. EXPERIMENTAL RESULTS For the implementation of the digit recognition system, the spoken digit ‘ONE’ is considered. For this ‘ONE’, the obtained results are as follows- Fig.4 Original and filtered cropped speech signal http://www.ijettjournal.org Page 324 International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 7 - Jun 2014 VII. CONCLUSION In this paper we have implemented isolated digit recognition with the help of MFCC as speech feature extraction method, Vector Quantization (VQ) as speech feature matching technique and Hidden Markov Model as a classifier. With the help of this model it is possible to make more robust and efficient and hence the performance of Speaker recognition system is improved. Though the methods proposed in this paper got better performance, there are still some issues to be further investigated. REFERENCES [1] [2] Fig.5 Short term energy and zero crossing rate in endpoint detection [3] TABLE 1 PARAMETERS FOR SPEAKER 1 MFCC Signal (Speaker 1) Mean Variance One Two Three Four Five Six Seven Eight Nine Ten 38.0097 37.8462 34.2821 48.9759 41.9466 28.8048 56.242 36.4913 38.3651 25.1202 422.344 395.1422 288.7179 493.6744 406.5802 250.0655 557.4576 314.8482 308.0966 235.3668 HMM Likelihood Values -182.6 -150.01 -165.87 -173.94 -226.03 -190.68 -182.35 -202.32 -216.32 -200.17 TABLE 2 PARAMETER FOR SPEAKER 2 MFCC Signal (Speaker 2) Mean Variance One Two Three Four Five Six Seven Eight Nine Ten 182.1683 163.7754 163.2248 186.9938 187.3901 150.4516 185.1616 175.9932 209.5781 184.8656 1517.8 948.2908 977.5702 1748.8 1739.8 972.5804 1728.4 1316 1831.9 1326.8 ISSN: 2231-5381 [4] [5] [6] [7] [8] [9] HMM Likelihood Values 136.2 110.17 34.07 123.42 158.27 74.59 70.18 51.79 100.844 52.01 [10] [11] [12] Lawrence Rabiner, Biing Hwang Juang, Fundamental of Speech Recognition, Copyright 1999 by AT&T. GIN-DER WU AND YING LEI “ A Register Array based Low power FFT Processor for speech recognition” Department of Electrical engineering national Chi Nan university Puli545 Taiwan. B. Yegnanarayana, S.R.M. Prasanna, J. M. Zachariah, and C.S. Gupta, “Combining evidence from source, suprasegmental and spectral features for a fixed- text specker verification system,” IEEE Trans. Speech Audio Process.,vol.13(4), pp. 575-82, July2005. N.Uma Maheswari, A.P.Kabilan, R.Venkatesh, “A Hybrid model of Neural Network Approach for Speaker independent Word Recognition”, International Journal of Computer Theory and Engineering, Vol.2, No.6, December, 2010 1793-8201. Satyanarayana “short segment analysis of speech for enhancement” institute of IIT Madras feb.2009 C.S.Myers and L.R.Rabiner, A Level Building Dynamic Time Warping Algorithm for Connected Word Recognition, IEEE Trans. Acoustics, Speech Signal Proc.,ASSP-29:284297,April 1981. H.Sakoe and S.Chiba, Dynamic programming algorithm optimization for spoken word recognition ,IEEE Trans. Acoustics, Speech, Signal Proc., ASSP-26(1).1978 Santosh K.Gaikwad, Bharti W.Gawali and Pravin Yannawar, “A Review on Speech Recognition Technique”, International Journal of Computer Applications (0975 – 8887) Volume 10– No.3, November 2010 Keh-Yih Su et.al, Speech Recognition using weighted HMM and subspace IEEE Transactions on Audio, Speech and Language. R.K.Moore, Twenty things we still don t know about speech, Proc.CRIM/ FORWISS Workshop on Progress and Prospects of speech Research an Technology, 1994. Shigeru Katagiri et.al, A New hybrid algorithm for speech recognition based on HMM segmentation and learning Vector quantization, IEEE Transactions on Audio Speech and Language processing Vol.1, No.4 L.R.Rabiner and B.H.jaung,” Fundamentals of Speech Recognition Prentice-Hall, Englewood Cliff, New Jersy, 1993 http://www.ijettjournal.org Page 325