International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 3 - Mar 2014 Speaker Identification with Back Propagation Neural Network Using Tunneling Alogorithm PPS Subhashini#1, Dr. M.Satya Sairam #2,Dr. D Srinivasarao#3 #1 #2 Associate Prof ,ECE Dept.,RVR & JC College of Engg., Guntur, Prof & Head, ECE Dept., Chalapathi Institute of Engg. and Technology, Guntur, #3 Prof & Head,Department of ECE,J N T U H, Hyderabad, Ab st r a ct speaker identification has been an active area of research in the past due to its diverse applications and it continues to be a challenging research topic. Back Propagation neural networks provides an attractive possibilities for solving signal processing & pattern classification problems. Several algorithms have been proposed for choosing the BP neural network prototypes & training the network. The selection of the BP prototypes & the network weights are the system identification problem. The proposed thesis implements an enhanced training method for BP neural network based on Tunneling algorithm. The proposed work is tested on the speaker Identification problem. Features are obtained by using linear predictive coefficients (LPC) and these features are classified by using Back propagation neural network. The efficiency of the proposed method is tested on the different speaker voices. It is shown that the use of Tunneling algorithm results in better fast learning to reach the global minima. keywords - ANN, Back Propagation Training, LPCC method, MLP network. I. INTRODUCTION Speaker recognition is the identification of the person who is speaking by characteristics of their voices (voice biometrics), also called voice recognition. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on specific person's voices or it can be used to authenticate or verify the identity of a speaker as part of a security process. Speaker recognition has a history dating back some four decades and uses the acoustic features of speech that have been found to differ between individuals. These acoustic patterns reflect both anatomy (i.e, size and shape of the throat and mouth) and learned behavioral patterns (i.e, voice pitch, speaking style.) Speaker verification has earned speaker recognition its classification as a "behavioral biometric".[1][2] Each speaker recognition system has two phases: Enrollment and verification. During enrollment, the speaker's voice is recorded and typically a number of features are extracted to form a voice print, template, or ISSN: 2231-5381 model. In the verification phase, a speech sample or "utterance" is compared against a previously created voice print. For identification systems, the utterance is compared against multiple voice prints in order to determine the best match(es) while verification systems compare an utterance against a single voice print. Because of the process involved, verification is faster than identification. Speaker recognition systems fall into two categories: text-dependent and text-independent. If the text must be the same for enrollment and verification this is called text-dependent recognition. In a text-dependent system, prompts can either be common across all speakers. Text-independent systems are most often used for speaker identification as they require very little if any cooperation by the speaker. In this case the text during enrollment and test is different. In fact, the enrollment may happen without the user's knowledge.[3] This paper focuses on speaker recognition using Multilayer Perceptron (MLP) Neural Network based on back propagation training algorithm. From the results it was observed that the proposed method has very high success rate in recognising different speaker identities. II. ARTIFICIAL NEURAL NETWORKS Neural network or Artificial Neural Network (ANN) is a massively parallel distributed processor made up of simple processing units, which has a natural propensity for storing experiential knowledge and making it available for use. A neural network contains a large number of simple neuron like processing elements and a large number of weighted connections encode the knowledge of a network. Though biologically inspired, many of the neural network models developed do not duplicate the operation of the human brain. [5] The intelligence of a neural network emerges from the collective behavior of neurons. Each neuron performs only very limited operation. Even though each individual neuron works slowly, they can still quickly find a solution by working in parallel. This fact can explain why humans can recognize a visual scene faster than a digital computer Training the Neural Networks can be accomplished in two ways. http://www.ijettjournal.org Page 106 International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 2 - Mar 2014 A. Working Principle supervised learning method unsupervised learning method In supervised learning, the model defines the effect one set of observations, called inputs, has on another set of observations, called outputs. In other words, the inputs are assumed to be at the beginning and outputs at the end of the causal chain. The models can include mediating variables between the inputs and outputs. Considering the MLP Neural Network with one hidden layer as shown in Fig.2 consisting number of input nodes I, the number of hidden nodes J and the number of output nodes K . The I-dimensional input z is passed directly to a In unsupervised learning, all the observations are assumed to be caused by latent variables, i.e. the observations are assumed to be at the end of the causal chain. With unsupervised learning it is possible to learn larger and more complex models than with supervised learning. This is because in supervised learning one is trying to find the connection between two sets of observations. Backpropagation method is the basis for training a supervised neural network. The output is a real value which lies between 0 and 1 based on the sigmoid function. III. MLP NEURAL NETWORKS Multi Layer Perceptron (MLP) is a type of artificial network for applications to problems of supervised learning such as pattern classification. A feed-forward network has a layered structure. Each layer consists of units which receive their input from units from a layer directly below and send their output to units in a layer directly above the unit. There are no connections within a layer. The inputs are fed into the first layer of hidden units. No processing takes place in input units. The training output values are vectors of length equal to the number of classes. After training, the network responds to a new pattern.[8] This network consists of three layers, one input layer, number of hidden layers and one output layer as shown in Fig. 1. The activation of a hidden units of layer of Nh,1 is a function f of the weighted inputs plus a bias .The output of the hidden units is distributed over the next layer of Nh,2 hidden units, until the last layer of hidden units, of which the outputs are fed into a layer of N0 output units [6] . The output of each hidden neuron is then weighted and passed to the output layer. The outputs of the network consist of sums of the weighted hidden layer neurons. The design of an MLP requires several decisions that include the number of hidden units in the hidden layer, values of the prototypes, the functions used at the hidden units and the weights applied between the hidden layer and the output layer [7]. The performance of an MLP network depends on the number of inputs, the shape of the sigmoid function at the hidden units and the method used for determining the network weights. MLP networks are trained by selecting the weights randomly from the training data. ISSN: 2231-5381 Fig.1 multi-layer network with l number of hidden layers hidden layer. Suppose there are J neurons in the hidden layer, each of the J neurons in the hidden layer applies unipolar sigmoid activation function defined in the Equation 1 given below ( )= ( . ) (1) Where >0 and net of a node is the summation of its weighted inputs. Where J is number of nodes in the hidden layer.Each of the K neurons in the Output layer applies Generalized Sigmoid activation function defined in the Equation 2 given below ( )= ( ) (2) ∑ Where K is the number of nodes in the output layer. The generalised sigmoid function is the general case of the sigmoid function with the ability to specify the steepness of the function as well as an offset that should be taken into consideration.The use of the generalized sigmoid function introduces additional flexibility into the MLP model. Since the response of each output neuron is tempered by the responses of all the output neurons, the competition actually fosters cooperation among the output neurons [15]. http://www.ijettjournal.org Page 107 International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 3 - Mar 2014 When the MLP Neural network is used in classification, the hidden layer performs clustering while the output layer performs classification. The hidden units apply Non linear sigmoid activation function for the input patterns. The output layer would linearly combine all the outputs of the hidden layer. Each output node would then give an output value, which represents the probability that the input pattern falls under that class. Back-propagation can be A. Back Propagation Training The first phase in the Back Propagation tunneling training algorithm is Back-Propagation training Algorithm. A two layer feed-forward network did not present a solution to the problem of how to adjust the weights from input to hidden units The solution is that the errors for the units of the hidden layer are determined by backpropagating the errors of the units of the output layer. The method is often called the back-propagation learning rule. Back-propagation can also be considered as a generalization of the delta rule for non-linear activation functions layer networks . The error measure E is defined as the total quadratic error for pattern p at the output units. When a learning pattern is clamped, the activation values are propagated to the output units, and the actual network output is compared with the desired output values, we usually end up with an error in each of the output units. Let I be the Number of input nodes, J be of Number of hidden nodes and K be the Number of output nodes for the MLP neural network. Consider V be the weight vector for hidden layer and W be the weight vector for output layer. The size of W matrix is K X J. The size of V matrix is JXI Fig.2 multi-layer network with single hidden layer applied to networks with any number of layers. In this paper a feed-forward neural network with a single layer of hidden units is used with a sigmoid activation function for the units. IV. BACKPROPAGATION NEURAL NETWORK The MLP neural network trained with backpropagation training algorithm is known as Back Propagation(BP) Neural Network.The error surface of a complex network is full of hills and valleys. Because of the gradient descent, the network can get trapped in a local minimum when there is a much deeper minimum nearby. Probabilistic methods can help to avoid this trap, but they tend to be slow. Another suggested possibility is to increase the number of hidden units. Although this will work because of the higher dimensionality of the error space, and the chance to get trapped is smaller, it appears that there is some upper limit of the number of hidden units which, when exceeded, again results in the system being trapped in local minima [8]. The Back Propagation with tunneling is used to train the Designed BP Neural Network which replaces the gradient-descent rule for MLP learning can find the global minimum from arbitrary initial choice in the weight space. The algorithm consists of two phases. The first phase is a local search that implements the BP learning. The second phase implements dynamic tunneling in the weight space avoiding the local trap and thereby generates the point of next descent. Alternately, repeating the two phases forms a training procedure that leads to the global minimum. This algorithm is computationally efficient. ISSN: 2231-5381 B. The steps for the training cycle 1) By applying the feature vectors one by one to the input layer the output of hidden layer is computed as = = 1,2, … (3) The function f1(.) is unipolar Sigmoid Function defined by using (1). Output of output layer is computed by using (4) ( = ) = 1,2, … (4) The function f2(.) is Generalized Sigmoid Function defined by using (2) 2) The error value is computed by using (5) = ( − ) + , = 1,2, … . , (5) 3) Error signal vectors and of both layers are computed. Dimension of Vector is (K X 1) and Dimension of is (J X 1). The error signal terms of output layer is given by using =( − )(1 − ) , (6) = 1,2, … . The error signal terms of hidden layer is given by using (7) http://www.ijettjournal.org Page 108 International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 2 - Mar 2014 = ∑ 1− , = 1,2, … 4) (7) The output layer weights are adjusted as = + , (8) = ,2, … = 1,2, … . The hidden layer weights are adjusted by using (9) = + , (9) = 1,2, … = 1,2, … . 6) Repeat the steps 1) to 5) for all the feature vectors Fig.3 Error surface 5) 7) The training cycle is repeated for 1000 epochs. V. TUNNELING ALGORITHM The Second phase in the Back Propagation tunneling training algorithm is dynamic tunneling. A random point W in the weight space is selected. For a new point + , where is a perturbation related to W, if ( + ) ≤ ( ), the BP is applied until a local minimum is found; otherwise, the tunneling technique takes place until a point at a lower basin is found as shown in the Fig 3. The algorithm automatically enters the two phases alternately and weights are modified according to their respective update rules. The tunneling is implemented by solving the deferential Equation by using (10) ( ) = ( ( ) ( )∗ − ) / (10) Where is the learning rate, () , ( ) = ( )∗ ( )∗ is the local minima of + ( ) VI. FEATURE EXTRACTION USING LINEAR PREDICTIVE CODING (LPC) A. Introduction. Linear Predictive Coding (LPC) is one of the most powerful speech analysis techniques, and one of the most useful methods for encoding good quality speech at a low bit rate. It provides extremely accurate estimates of speech parameters, and is relatively efficient for computation. [4] B. Basic Principles. LPC starts with the assumption that the speech signal is produced by a buzzer at the end of a tube. The glottis (the space between the vocal cords) produces the buzz, which is characterized by its intensity (loudness) and frequency (pitch). The vocal tract (the throat and mouth) forms the tube, which is characterized by its resonances, which are called formants. LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering, and the remaining signal is called the residue. (11) The equation is integrated for a fixed amount of time, with a small time-step ∆ . After every ∆ , E(W) is computed () with the new value of keeping the remaining components of W unchanged. Tunneling comes to a halt when E(W) E(W*), and initiates the next gradient descent. If this condition of descent is not satisfied, then this process is repeated with ( )∗ all the components of until the above condition of descent is reached. If the above condition is not satisfied () for all , then the last local minimum is the global minimum [21][22]. ISSN: 2231-5381 The numbers which describe the formants and the residue can be stored or transmitted somewhere else. LPC synthesizes the speech signal by reversing the process: use the residue to create a source signal, use the formants to create a filter (which represents the tube), and run the source through the filter, resulting in speech. Because speech signals vary with time, this process is done on short chunks of the speech signal, which are called frames. Usually 30 to 50 frames per second give intelligible speech with good compression.[20] http://www.ijettjournal.org Page 109 International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 3 - Mar 2014 C. Windowing. For short-term analysis the signal must be zero outside of a defined range this is performed by multiplying the signal with a window normally we choose a windowwidth of 20 – 30 ms .the window-shift is usually 10 ms. D. Window shapes. Window shapes may be rectangular window, Wn =1, Hamming window , Wn= 0.54-0.46 cos(2wn / (N-1)). LPC coeff speaker Speaker1 Speaker1 Speaker1 Speaker1 Speaker2 Speaker2 Speaker2 Speaker2 Speaker2 Speaker3 Speaker3 Speaker3 Speaker3 Other common windows: Gauss-, Hann-, Blackmann-Window. Rectangular window,Wn=1, And window width of 20ms. LPC uses the autocorrelation method of autoregressive (AR) modeling to find the filter coefficients. The generated filter might not model the process exactly even if the data sequence is truly an AR process of the correct order. This is because the autocorrelation method implicitly windows the data, that is, it assumes that signal samples beyond the length of x are 0. Coeff 1 Coeff 2 Coeff 3 Coeff 4 Coeff 5 Coeff 6 Coeff 7 Coeff 8 Coeff 9 Coeff 10 Coeff 11 1.000 1.9501 1.2311 1.6068 1.000 1.1897 1.1934 1.6822 1.000 1.000 1.0648 1.9883 1.5828 -2.801 -1.851 -2.570 -2.194 -2.389 -2.199 -2.196 -1.707 -2.389 -2.167 -2.102 -1.178 1.5843 2.619 3.569 2.850 3.2262 1.9524 2.1421 2.1458 2.6346 1.9524 1.3171 1.3819 2.3054 1.8999 -0.1713 0.77883 0.05983 0.43554 -0.7100 -0.5203 -0.5165 -0.0278 -0.7100 -0.1625 -0.0977 0.8258 0.42026 -1.3305 -0.3803 -1.0994 -0.7236 0.21869 0.40834 0.41212 0.90091 0.21869 0.54389 0.60867 1.5322 1.1267 0.2278 1.178 0.45901 0.83471 0.83471 -0.1026 0.08703 0.09081 0.5796 -0.8750 -0.8103 0.11324 -0.2923 1.4217 2.3718 1.6528 2.0285 0.02764 0.2173 0.22107 0.70987 0.21285 0.27763 1.2012 0.79564 0.795 -1.4536 -0.5034 -1.2225 -0.8467 0.02490 0.21456 0.21834 0.70713 0.25396 0.31874 1.2423 0.83675 0.836 0.54563 1.4958 0.77677 1.1525 0.056714 0.24637 0.25015 0.73894 0.091236 0.15602 1.0796 0.67403 0.764 -0.03196 0.91817 0.19918 0.57488 -0.14382 0.045834 0.049611 0.5384 -0.37531 -0.31053 0.61302 0.20748 0.209 -0.02005 0.93008 0.21109 0.58679 0.070365 0.26002 0.2638 0.75259 0.17657 0.24135 1.1649 0.75936 0.7541 Table.1 Linear predictive coding (LPC) coefficients for three speaker VII. EXPERIMENTAL RESULTS This project aims towards the implementation of Tunneling algorithm to train a Back propagation neural network for the enhancement of the computational effort required for training the network.The test are carried out for speaker recognition problem. For each speech signal 11 features are extracted using Linear Predictive Coding (LPC) technique & is classified into one of 3 speakers as shown in the Fig 4. The Back propagation network is trained with five utterances for each word and tested with five more utterances which are different from the trained utterances.The network was designed to recognize the 3 ISSN: 2231-5381 speakers, 11 features are extracted from each using standard LPC analysis .The three speech signals for three speakers for the word ‘hello’ are shown in the Fig 5, Fig 6 and Fig 7 respetively. The features so obtained for three speakers for four utterances each is tabulated in Table1.The designed BP neural network consists of 11 inputs nodes, with variable number of hidden nodes from 1 to 15 and 3 output nodes . The experimental results are verified by plotting percentage of correct classification along Y axis and Variable number of hidden nodes on X axis which is shown in Fig 8 and it was observed that as the number of hidden nodes are varied between 1 and 15 the percentage of correct classification also increases linearly. http://www.ijettjournal.org Page 110 International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 2 - Mar 2014 Fig.4 Block Diagram Fig.8 percentage of correct classification verses No. Of hidden nodes Fig.5 speaker 1 speech signal for the word hello VIII. CONCLUSION The Back Propagation Neural Network training with Tunnelling algorithm is proposed for Identification of different speakers. The proposed method is tested for three speakers of ten utterances each. It is found that this method requires less time for training the BP neural network and has good success rate in Identification of speaker. It was also observed that the percentage of correct classification increases with increase in number of hidden nodes. References: Fig.6 speaker 2 speech signal for the word hello Fig.7 speaker 3 speech signal for the word hello ISSN: 2231-5381 [1] Rabiner LR, juang BH, “Fundamentals of speech recognition”, Prentice Hall India [2] Sadaoki Furui ,Furui Furui, “Digital Speech Processing, Synthesis, and Recognition” [3] M R Schroeder, ‘Speech and speaker Recognition ’ [4] John L Ostrander, Timothy D ,”Speech Recognition Using LPC Analysis” [5] Tom Mitchell, “Artificial Neural Networks,” in Machine Learning, 1st ed. McGraw- Hill ,1997 pp.95-111 [6] peter dreisiger, cara macnish2 and wei liu “estimating conceptual similarities using distributed representations and extended backpropagation” (ijacsa) International journal of advanced computer science and applications, vol. 2, _o. 4, 2011 http://www.ijettjournal.org Page 111 International Journal of Engineering Trends and Technology (IJETT) – Volume 9 Number 3 - Mar 2014 [7] O.Batsaikhan and Y.P. Singh, Mongolian Character [17] john h. l. hansen and brian d. womack , “Feature Recognition using Multilayer Perceptron, Proceedings Analysis And Neural Network-Based Classification of the 9th International Conference on Neural Of Speech Under Stress” , ieee transactions on speech Information Processing, Vol. 2,2002. and audio processing, vol. 4, no. 4, july 1996 307 [8] D.Y. Lee, Handwritten digit recognition using K [18] sharada c. sajjan and vijaya c,” Speech Recognition nearest neighbour, radial basis function and Using Hidden Markov Models” , world journal of backpropogation neural networks, Neural science and technology 2011, 1(12): 75-78 issn: 2231 Computation, Vol. 3, pp. 440-449. – 2587 [9] a.m. numan-al-mobin, mobarakol islam, [19] ben pinkowskilpc, “Spectral Moments For Clustering kaustubhdhar, tajul Islam, md. rezwan,m. Hossain Acoustic Transients” , ieee transactions on speech and “backpropagation with vector chaotic learning rate” audio processing, vol. 1, no. 3, july 1993 [10] marine campedel-oudot, member, ieee, olivier cappé, [20] aki härmä, member, ieee , “Linear Predictive Coding member, ieee, and eric moulines, member, ieee , “ Withmodified Filter Structures” , ieee transactions on Estimation Of The Spectral Envelope Of Voiced speech and audio processing. Sounds Using A Penalized Likelihood Approach” ieee [21] chowdhury, singh, “use of dynamic tunneling with backpropagation in training feedforward neural transactions on speech and audio processing, vol. 9, networks” no. 5, july 2001 469 [22] t. kathirvalavakumar, p. thangavel, “a modified [11] jonas samuelsson and per hedelin, “Recursive Coding backpropagation training algorithm for feedforward Of Spectrum Parameters’’ ieee transactions on speech neural networks.” and audio processing, vol. 9, no. 5, july 2001 [12] bishnu s. atal “The History Of Linear Prediction” [13] roy c. snell, member, zeee, and fausto milinazzo, “Formant Location From Lpc Analysis Data” Ieee Transactions On Speech And Audio Processing, vol. 1, no. 2, april 1993 129 [14] madre, g.baghious, “ Linear Predictive Speech Coding Using Fermat Number Transform” [15] shih-chi huang and yih-fang huang , “Learning Algorithms For Perceptrons Using Back-Propagation With Selective Updates” [16] fu-chuang chen, “Back-Propagation Neural Networks Fornon linear self tuning adaptive Control “ ISSN: 2231-5381 http://www.ijettjournal.org Page 112