International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 4- January 2016 A Comparison of Feather Extraction Methods for Language Identification using GMM Dr A.Nagesh Professor&HOD Mahatma Gandhi Institute of Technology Abstract: In this paper the feature extraction 2. Feature Extraction Method techniques used for language identification task and The main purpose of the feature extraction process is their performance evaluation are presented. The to extract the most relevant information from the Language identification is the task of identifying the speech waveform and discard as much of the language from short duration of speech utterance. redundant information. In language recognition, The important stage in LID task is feature extraction. feature extraction is given a lot of importance The LID performance mainly depends on features because recognition performance depends heavily on this step. The main goal of the feature extraction used for the language identification task. There are step is to compute a set of feature vectors different features which are used in LID task are providing a compact representation of the given Linear Predictive Coding, Mel Frequency Cepstral input signal. Through decades of research, many coefficients MFCC and Perceptual PLP. different feature representations of the speech signal have been suggested and tried. In this paper presents Keywords—Language Identification (LID), Linear the review of LPC, MFC and PLP feature extraction Predictive Coding (LPC), Mel-frequency cepstral methods with GMM modeling is presented for the evaluation of language identification performance. coefficients (MFCC), Perceptual Linear Predictive (PLP). 2.1 Linear Predictive Coding Linear Predictive Coding (LPC) is one of the most 1. Introduction powerful speech analysis techniques. The glottis (the space between the vocal cords) produces the sound., Over the past three decades there is the which is characterized by its intensity (loudness) and tremendous development in the area of speech frequency (pitch). The vocal tract (the throat, the processing. Applications of speech processing mouth and the nasal cavity) forms the tube, which is include speech/ speaker recognition, language characterized by its resonance, frequencies, which are identification etc. Language identification system called formants. The basic problem of the LPC automatically identifies the specific language from system is to determine the formants from the speech a spoken utterance. The primary objective of signal. The solution of this problem is a different automatic language identification (LID) system is to equation, which expresses each sample of the signal determine the language identity from the uttered as a linear combination of previous samples. Such an speech. Due to several real-life applications of equation is called a linear predictors i.e Linear automatic LID systems such as, speech to speech Predictive Coding. The coefficients of the different translation systems, information retrieval from equation( the predictive coefficients) characterize the multilingual audio databases and multilingual speech formants. Therefore, the LPC system needs to recognition systems, it has become an active research estimate these coefficients. The estimation is made problem in India and abroad. Here our main focus is by minimizing the mean square error between the on LID cues based feature for language identification predicted signal and the actual signal. system. Automatic language identification is an application of pattern recognition. Automatic language Linear prediction models the output s(n) as the identification system is like any other pattern linear function of past outputs and present and past recognition system. The language identification task inputs. Assuming an all-pole model for the vocal involves three phases namely feature extraction tract, the signal s(n) can be expressed as a linear phase, training phase and testing phase. Feature extraction is a process of obtaining features from a speech signal. Training is the process of familiarizing the system with the characteristics of a language, whereas testing is the actual identification task. ISSN: 2231-5381 combination of past values and some input shown below. http://www.ijettjournal.org p s ( n) u(n) as ak s (n k ) Gu (n) where k 1 Page 218 International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 4- January 2016 G is a gain factor. Now assuming that the input u(n) is unknown, the signal s(n) can be predicted only approximately from a linear weighted sum of past samples. Let this approximation of s(n) be p sˆ(n) , where sˆ(n) ak s (n k ) k 1 Then the error between the actual value s(n) and sˆ(n) is given by e(n) s(n) sˆ(n) Gu(n) the predicted value e n s n sˆ n The basic problem of linear prediction analysis is to determine the set of predictor coefficient from the speech signal so that the speech properties of the digital filter match those of the speech wave form within the analysis window. Cepstral weighted feature vector is obtained for each frame by block processing of continuous speech signal. To spectral flatten the signal, the speech signal has been subjected to the pre-emphasis procedure through a first order digital filter whose transfer function is given by where p s n ak s n k k 1 The error transfer function is The linear prediction method provides a robust, reliable, and accurate method for estimating the parameters. The computation involved in LPC processing is considerably less than cepstrum analysis. Fig.1: Block diagram illustrating the steps involved in the computation of the of LPC 2.2Mel Frequency Cepstral coefficients MFCC: Mel-frequency cepstral coefficients one of the important LID cues is phonemic difference between the languages. The MFCCs capability to capture the phonetically important characteristics of speech has motivated to use this feature vectors for LID task. The Mel-frequency cepstral coefficients are most widely used feature vectors in language identification system. Based on the human auditory perception, Stevens and Volkmann developed the Mel-scale. The Mel-scale was used by Mermelstein and Davis to extract features from the speech signal for improved recognition performance. This section describes how the input speech signal transforms into a sequence of MFCC feature vectors. The steps ISSN: 2231-5381 followed in the computation of the MFCC are as follows: 1. The speech signal is sampled at 16 KHz.2. 25msec. speech signal is taken as one segment (frame size) and 10msec. as frame shift. 3. The speech signal is analyzed over a short time analysis window. From each short time analysis window, spectrum is obtained by using Discrete Fourier Transform (DFT). 4. This spectrum when passed through mel-filters, mel-spectrum is obtained. 5. The log of the mel spectrum is taken.6. By performing spectral analysis on mel-spectrum, MFCC are obtained. Now entire signal is represented in the form of feature vectors. These feature vectors are given as input to the LID system. The block diagram for the computation of the MFCC is shown in fig. 2. http://www.ijettjournal.org Page 219 International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 4- January 2016 Pre-Emphasis & Framing Mel Frequency Cepstral Coefficients Power Spectrum DFT Discrete Cosine Transform Mel Scale Filter Banks Compute Filter Bank Energies Logarithm Fig.2 : Block diagram illustrating the steps involved in the computation of the Mel Frequency Cepstral Coefficients (MFCC). 2.3 Perceptual Linear Predictive Perceptual Linear Predictive features are among the most frequently used features in speech processing. The main objective of the original PLP is to describe the psychophysics of human hearing more accurately in the feature extraction process. The perceptual linear predictive speech analysis technique is based on the short-term spectrum of speech. They are the based on the idea to use knowledge about human perception to emphasize important speech information is spectra while minimizing the differences between speakers. The PLP consist of the following steps: 1.The power spectrum is computed from the windowed speech signal. 2. A frequency warping into the Bark scale applied.3. The auditory warped spectrum is convoluted with the power spectrum of the simulated critical-band integration of human hearing. 4. The smoothed spectrum is down-sampled at intervals of ≈1 Bark. The three steps frequency warping, smoothing and sampling (ii-iv) are integrated into a single filter-bank called bark filterbank.5.An equal-loudness pre-emphasis weights the filter-bank outputs to simulate the sensitivity of hearing.6. The equalized values are transformed according to the power law of stevens by raising each to the power of 0.33. 7. The resulting auditorily warped line spectrum is further processed by linear prediction (LP). Fig.3 : Block diagram illustrating the steps involved in the computation of the PLP. 3. Gaussian Mixture Model Training In a GMM model, the probability distribution of the observed data takes the form given by the following equation, ISSN: 2231-5381 http://www.ijettjournal.org M p( x | ) pi bi ( x ) i 1 Page 220 International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 4- January 2016 M is the number of component densities, x is a D dimensional observed data, bi (x ) is the the component density and pi is the mixture weight for P X where i = 1, .., bi ( x ) maximum P X M X 1 (2 ) D 2 12 1 (x 2 exp i i )' 1 i (x i dimensional normal distribution with mean vector and covariance matrix satisfy the condition i i . The mixture weights M i 1 pi 1 represent positive scalar values. can be collectively represented as and therefore These parameters pi , i , estimation, where is maximized with respect to i for i = 1, … M . Each language in a language identification system can be represented by one distinct GMM and is referred by the language models i , for i= 1, 2, 3,…..N, where N is the number of languages. is the conditional probability and vector x1 , x2 ...xt is the set of all feature vectors 3.1 Training the Model Clusters are formed within the training data. Each cluster is then represented with multiple Gaussian probability distribution function (pdf). The union of many such Gaussian pdfs’ is a GMM. The most common approach to estimate the GMM parameters is Table : LID performance for three language task using GMM. Language Method Time (sec.) 4 Identification Performance (%) No. of Gaussians 6 8 12 16 32 45 50 43 49 55 48 55 60 52 49 54 46 51 56 52 57 63 55 52 57 50 54 59 55 60 67 60 55 58 54 57 62 60 64 70 61 59 60 55 61 66 61 66 72 63 60 65 57 64 69 64 69 74 66 40 50 40 42 54 43 45 57 45 49 58 50 52 60 51 57 65 53 LPC English MFCC PLP LPC MFCC PLP LPC MFCC PLP LPC MFCC PLP ISSN: 2231-5381 1 2 3 1 . belonging to a particular acoustic class. Since there is no closed form solution to the maximum likelihood estimation, convergence is guaranteed only when large enough data is available. An iterative approach is followed for computing the GMM model parameters using Expectation-Maximization (EM) algorithm. The aim of training is to obtain the mean, variance, and weighting of each Gaussian distribution ( ). The feature vectors are clustered using K-means clustering algorithm. The number of clusters chosen are 64, which are the approximately union of the phonemes in all the languages under consideration. Now the GMM is re-estimated using EM algorithm. For each language Li , one GMM is created. This procedure is repeated for all the languages under consideration and separate GMMs are created. ) bi (x ) denotes a D- Each component density likelihood http://www.ijettjournal.org Page 221 International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 4- January 2016 French LPC MFCC PLP LPC MFCC 2 3 44 50 46 51 57 46 52 50 52 60 49 55 51 54 64 50 56 54 57 66 55 60 57 58 68 59 62 60 61 70 59 56 57 59 60 64 48 44 42 45 52 49 52 59 60 51 46 45 48 54 50 53 62 63 53 47 48 51 56 53 55 66 64 57 50 50 53 58 53 58 68 65 58 53 53 55 62 58 60 70 66 60 57 56 58 63 61 62 72 68 LPC 1 German MFCC PLP LPC MFCC PLP LPC MFCC PLP 2 3 Conclusion LPC, MFCC and PLP are the most proposed acoustic features used in language identification. The performance of the LID system mainly depends on the type of feature vectors are used for the task. In this paper we are compared the language identification performance analysis for different feature extraction method using GMM. The LID performance can be improved by combining more features of the speech signal. [9] Eliathamby Ambikairajah Haizhou Li, Liang Wang, Bo Yin and Vidhyasaharan Sethu , Language identification: A Tutorial IEEE Circuits and systems Magazine second quarter 2011. REFERENCES: [1]Vibha Tiwari, ―MFCC and its applications in speaker recognition‖ , International Journal on Emerging Technologies 1(1): 19-22(2010). [2] H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech", J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990. [3] H. Hermansky and N. Morgan, "RASTA processing of speech", IEEE Trans. on Speech and Audio Proc., vol. 2, no. 4, pp. 578-589, Oct. 1994 [4] Premakanthan P. and Mikhad W. B., ―Speaker Verification/Recognition and the Importance of Selective Feature Extraction: Review‖ , MWSCAS. Vol. 1, 57-61, 2001. [5] B. S. Atal, ―Automatic Recognition of Speakers from their Voices‖ , Proceedings of the IEEE, vol. 64, 1976, pp 460 – 475. [6] Douglas A. Reynolds and Richard Rose, ―Robust Text Independent Speaker Identification using Gaussian Mixture Speaker Models‖ , IEEE transaction on Speech and Audio Processing, Vol.3, No.1, January 1995. [7] Bageshree V. Sathe-Pathak, Ashish R. Panat, ―Extraction of Pitch and Formants and its Analysis to identify 3 different emotional states of a person‖ , International Journal of Computer Science Issues, Vol. 9, Issue 4, No 1, July 2012. [8] D.A. Reynolds, ―Experimental evaluation of features for robust speaker identification‖ , IEEE Trans. Speech Audio Process. , vol. 2(4), pp. 639-43, Oct. 1994. ISSN: 2231-5381 http://www.ijettjournal.org Page 222