International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016) Generation of Real Time Database for Speech Recognition System Vivek Khalane#1, UmeshBhadade#2 #1 Assistant Professor, Department of Instrumentation Engineering,R.A.I.T, Nerul,India #2 Professor, Department of Information Technology, S.S.B.T, COE, Jalgaon, India Abstract—In real time speech recognition environments, high quality speech segmentation is required.In recent trends, speaker identification gives satisfactory results when the training is done offline using good quality database file. The most of work is concentrated on the offline training structure. However in some cases due to lack of prior information offline training is not possible. To generate the dynamic feature database, runtime training approach is required. In this paper we present speaker recognition system using polynomial/ GMM (Gaussian Mixture model) as a classifier and MFCC (Mel Frequency CepstralCoefficeints) as feature. The proposed system gives 97% accuracy for five speaker’s database. Keywords—Speech recognition, Dynamic database creation, Speaker recognition. I.INTRODUCTION The complete discussion can be acknowledged by any person but in case of machine, suppression of speaker‘s speech is tough task and it is very difficult to do without any prior training. The training and creation of speaker database is required in real time speaker recognition. The smart media player is used in real time speaker recognition. It requires acknowledging the unique speaker who is present in the content during playback. For real time speaker recognition, first stage to segment speaker speech for training and apply for automatic speakeridentification. Different types of algorithm are used in speak recognition[6],[7],[8],[9],[10]. Mel Frequency Cepstral Coefficients (MFCCs) is widely acceptable feature [1]. We apply Gaussian mixture models (GMM) for classification [2]. Rest of paper is organized as follows, in section II basic preliminaries required is explained. Section III discusses the proposed algorithm for run time database generation. Section IV and V describes the experimental set-up and result respectively and in section VI conclude the paper. I. PRELIMINARIES Basic preliminaries for speaker as below:Speaker modelling is retrieved from model. The retrieved model have different class hence, it can be solved like any other classification problem. The classifier is help to find out class as explained below. A. MFCC features To adapt speech specificities MFCC features are described as a filter bank processing. Also deconvolution technique is used for modification of conventional cepstrum[1]. The speech signal is splitted into short time windows where we calculate the Discrete Fourier Transform (DFT) of each window for discrete time signal x(n) with length N, given by, N 1 X k w(n) x(n)e kn / N (1) n 0 Forn=0,1,…….,N-1 where, k corresponds to the frequencyf(k) = k /N where, is the sampling frequency in Hertz and w(n) is a time-window. Here, we chose the popular Hamming window as a time window, given by, (2) The magnitude spectrum |X(k)| is now scaled in both frequencyand magnitude. First, the frequency is scaled logarithmically using the so-called Mel filter bank H(k,m)and then the logarithm istaken, giving N 1 X '(m) ln( X (k ) H (k .m) (3) k 0 Finally, the MFCCs are obtained by computing the DCT using, M 1 (4) c( I ) X '(m) cos( I (m )) M 2 m 1 For I = 1,2,….,M where c(I) is the Ith MFCC and M is the numberof Mel filters used. B. Classifiers The classifier is known as algorithm that implements concrete classification. In this paper we consider two classifier as, 1) Polynomial Classifier The polynomial classifier consists of feature vectors The polynomial p(x) for given featured vectorsX=[ x1,x2] as, p(x) = (5) In general the polynomial basis is formed as, xi1 xi 2 , ISSN: 2231-5381 j2 http://www.ijettjournal.org ….. xik (6) Page 350 International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016) Where, k is less than or equal to the order of the polynomial K. For each input feature vector the score of matching is calculated using the inner product. Finally the score is averaged out to get final output. N 1 N s wt p ( x1 ) (7) i 1 2)GMM classifier GMM parameters are derived fromExpectationMaximization (EM) algorithm or Maximum A Posteriori (MAP) estimation from well trained prior model [3].A Gaussian mixture model is a weighted sum of M component Gaussian densities as given by the equation, p( X M ) i i 0 g( X i , i .) (8) Where X is a D-dimensional data vector (i.e. measurement or features), i , i= 1,2,…..M are the mixture weights, and g( X i , .) , i= 1,2,….M are the component Guassiandensities. Each component density is a DvarientGuassian function of the form, g X i , i i 1 . (2 ) exp 1 X 2 D 1 2 i . 1 ' i 2 i . X i activity detection algorithm. Furthersections are separated into two parts which explained dynamic database creation and speaker recognition. B. Dynamic database creation The Cluster of speech is used by segmenting speech online for dynamic database creation. This is followed by training speaker database to individual speakers for recognition purpose. To detect speaker variable points in the speech Segmentation is carried out. A repeated portion of conversation starting from a individual speaker is termed as segment [6], [7], [8], [9], [10]. Fig. 1 shows description about visual aid. The observation of MFCC vectors within specific windows gives segmentation. Observed MFCC vectors are performed using segmentation for a prescribed data in the window. We consider the speaker changes point at center of window. First, MFCC vectors of 1 st and 2nd half of prescribed window are used to compute GMM model N1 and N2. A distance matrix BIC (Bayesian Information Criteria) is measured between N1 and N2[11]. The distance measure D is given as, D BIC ( N1 , N 2 ) BIC ( N 2 , N1 ) (11) Obtained value is saved and window is shifted for next repetition. For remaining vectors the array of distance value is calculated. The adaptive threshold technique is used to avoid the false change detection. (9) i w Th W With mean vector i and covariance matrix . i .The mixture weights satisfy the constraint that M 1 . The complete Gaussian mixture model is i i 0 parameterized by the mean vectors, covariance matrices and mixture weights from all component densities. These parameters are collectively represented by the notation, i , i , i i j w D(i ) 2w 1 (12) Where, W is weight and w is the number of difference values taken into account of the left and right local neighbor windows, ‗i‘ is the frame under consideration. We consider the speech segment which form cluster belonging to individual speaker. The same clusters are used to train algorithm as shown in Fig 2. . i 1, 2,...M (10) C. Polynomial classifier for speaker Recognition For speaker recognition, polynomial based classifier is used because of its computational constrained to be diagonal. simplicity [5]. For testing purpose we divide online speech and calculate feature of every sample which is II. PROPOSED ALGORITHM used for matching purpose with available database. In proposed algorithm the speaker cluster is created The input speech sample is divided into 20ms frame by GMM based speaker segmentation. And it is used having 50% overlap time. The sharp truncation of data for the dynamic database creation. The MFCC feature at the edge of frame window causes loss of and polynomial based classifier is used for recognition. information which is avoided by overlapping. Then frame is passed through low pass filter which remove A. Preprocessing noise. The MFCC features are extracted in above The input speech consist of high frequency noise, section. It also calculates the Inverted Mel Frequency such noise is removed by FIR low pass filter. The Cepstral Coefficients (IMFCC) [4],[5]. While testing average of feature vectors over the given speech we extract features and apply it on individual segment is used for training and testing of speakers. In polynomial speaker model in MFCC and IMFCC such cases the silence may decrease accuracy of model we get two scores [12]. The scores the final speech recognition. It can be removed using voice matching score is calculated as, The covariance matrices ISSN: 2231-5381 . can be full rank or i http://www.ijettjournal.org Page 351 International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016) cannot be considered to increase with number of test samples. Fig. 3 shows the result of speaker segmentation. Fig. 1 Illustration of speaker change point detection. Score MFCCscore 1 IMFCCscore (13) We chose the value of empirically. The empirical value of is selected as 0.56. III. EXPERIMENTAL SETUP The recording is done using MI Prime IV with recording sampling frequency 16 KHz. It consists of 15 recordings of 5 speakers while recording each speaker speaks one after another with pause of 3 sec. The total duration of recording is approximately 20 sec. The sampling data consist of 20 to 30 utterances from the same 5 speakers. Fig. 3Difference measure graph with local extremes detected. Table I gives result of recognition results assumingsegmentation is accurate. TABLE I Experimentation Results for recognition. Number of Speakers 5 5 Number of Test Samples 20 30 Sampling Frequency 16kHz 16kHz Accuracy of Recognition (Prototype) 96.5% 91.5% V. CONCLUSIONS We presented a speech recognition system with real time database creation to introduce the dynamic speaker recognition system. For the database creation part, we proposed speech segmentation technique. The proposed system is analyzed on the real time dataset. By experimenting, we conclude that the proposed method shows better result with the additional feature of offline training systems. In future, the results of recognition system can be studied under environmental or background noise effects in this work. The result accuracy of the proposed algorithm can be evaluated in case of multiple competing speakers speaking at same time. REFERENCES [1] Davis, S. and Merlmestein, P., "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences", IEEE Trans. on ASSP(1980). [2] Reynolds D. A., 1995, Automatic speaker recognition using Gaussian mixture speaker models, Lincoln Lab. J., 8( 2), 173192. Amit S. Malegaonkar, Aladdin M. Ariyaeeinia, and PerasiriyanSivakumaran, ‖Efficient Speaker Change Detection Using Adapted Gaussian Mixture Models‖, IEEE Trans. on Audio, Speech and language Processing, 15 (6) (2007). Fig. 2 Dynamic database creation using speech segmentation. IV. RESULTS AND DISCUSSIONS The obtained results by proposed speech segmentation algorithm are quite good. Without any prior knowledge of number of speakers or the speaking characteristics the proposed system generates dynamic database. Hence the recognition accuracy ISSN: 2231-5381 [3] [4] SandipanChakroborty and GoutamSaha, ‖ Improved TextIndependent Speaker Identification using Fused MFCC & IMFCC Feature Sets based on Gaussian Filter‖, International http://www.ijettjournal.org Page 352 International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016) Journal of Information and Communication Engineering 5:1 (2009). [5] Campbell W. M., Assaleh K. T. and Broun C. C.,‖Speaker recognition with polynomial classifiers‖, IEEE Trans. on Speech and Audio Processing, 10 (4), 2002, pp. 205-212. [6] H. Gish, M.-H. Siu, and R. Rohlicek, ―Segregation of speakers for speech recognition and speaker identification,‖ in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., (ICASSP‘91), 1991, vol. 2, pp. 873–876. [7] P. Delacourt and C. J. Wellekens, ―DISTBIC: A speakerbased segmentation for audio data indexing,‖ Speech Commun., vol. 32, no. 1–2,pp. 111–126, 2000. [8] J. Ajmera et al., ―Robust speaker change detection,‖ IEEE Signal Process. Lett., vol. 11, no. 8, pp. 649–651, Aug. 2003. [9] M. Roch and Y. Cheng, ―Speaker segmentation using MAPAdapted Bayesian Information Criterion,‖ in Proc. Speaker Lang. Recognition Workshop (Odyssey), 2004, pp. 349–354. [10] V. Karthik, D. S. Satish, and C. C. Sekhar, ―Speaker change detection using support vector machine,‖ in Proc. 3rd Int. Conf. Non-Linear Speech Process., Barcelona, Spain, Apr. 19–22, 2005, pp. 130–136. [11] Patil, BhushanDayaram, YogeshManav, and PavanSudheendra. "Dynamic Database Creation for Speaker Recognition System." Proceedings of International Conference on Advances in Mobile Computing & Multimedia. ACM, 2013. [12] Angal, Yogesh S., Pawan K. Ajmera, and Raghunath S. Holambe. "Speech Recognition of Isolated Words in Noisy Conditions Using Radon Transform and Discrete Cosine Transform Based Features Derived from Speech Spectrogram." Digital Signal Processing 4.5 (2012): 178-183. ISSN: 2231-5381 http://www.ijettjournal.org Page 353