Speaker Identification Using Wavelet and Feed Forward Neural Network ADZNAN B. J., M.A ALMASHRGY, ELSADIG AHMED, ABD RAHMAN RAMLI Department of Computer and Communication Systems Engineering/ UPM Faculty of Engineering University Putra Malaysia 43400UPM, Serdang, Selangor DE MALAYSIA Abstract: - This paper carried out Wavelet Packet Transform (WPT) - based feature extraction technique. The WPT was used to decompose speech signals into sub bands, similar to Mel scale. It takes the Mel Frequency Cepstral Coefficients (MFCC) for each band as a feature of the speech signal, which is used to differentiate between speakers. These features have to go for classification. Feed Forward Neural Network (FFNN) is proposed for the classification task. Instead of using one NN for all speakers, we provide separate NNs for each speaker. This system is applied on text-dependent speaker recognition in a noisy environment. Experiments are conducted on ten speakers and they gave considerable results. These yield a very promising accuracy of about 89.9%. Key-Words: - Wavelet Packet Transform (WPT), Feed Forward Neural Network (FFNN), Text Dependant speaker recognition, Speaker Identification Recognition, False Identification Error (FIE), Discrete Cosine Transform (DWT). 1 Introduction Speaker recognition is the process of automatically recognizing who is speaking on the basis of individual information inside a speech waves. The goal of speaker recognition system is to extract, characterize and recognize the information in the speech signal, conveying speaker identity [1]. The problem of speaker recognition can be divided into two sub problems: the speaker identification and verification. Speaker identification is the task of determining who is speaking from a set of unknown voices or speakers. The unknown person makes no identity claim. Speaker verification is the task of determining who a person (he/she) claims to be (yes/no decision) [1]. An error that can occur in speaker identification is the false identification of speaker and the error in speaker verification can be classified into two types: (1) false rejection: true speaker is rejected and considered an impostor, and (2) false acceptance: a false speaker is accepted as a true one [2]. Speaker recognition can be divided into two: textdependent and text-independent methods. In case of text-dependent method, a speaker is required to utter a predetermined set of words or sentence. The features are extracted from the utterance. In case of text independent method, there is no predetermined set of words or sentences and the speakers may not even be aware that they are being tested [3]. All speaker recognition systems are divided into two parts. First, is extraction and the second is classification or modeling of speaker based on the extracted features. Conventional speaker recognition is performed by recognizing an extracted set of acoustic feature vectors, which are calculated over the whole frequency band of input speeches (full band technique). There are many methods which have been used for extraction feature from speech signals. For instance, frequency band analysis, formant frequencies, pitches contours, coarticulation, short term processing, linear prediction, and mel-wraped cepstrum. For more details refer to [3], [4]. Wavelet and time frequency methods have been shown to be effective signal processing techniques over the last two decades for a variety of problems. In particular, wavelets have been successfully applied to de-noising tasks and as robust features [5]. There has been recent interest to use wavelet in speech recognition. [6, 7] use a wavelet on speech recognition, and DWT for decomposing the speech signal. Instead of dealing with DWT, this paper proposes to split the signal into sub-bands similar to Mel scale using WPT. The advantage of using WPT is that it can segment the frequency axis and has uniform translation in time [8]. A number of modeling techniques have been applied to speaker recognition including dynamic time warping, nearest neighbor, hidden Markov modeling, Vector quantization, neural network, etc. In this paper we propose to use Feed Forward neural Network (FFNN). The most previous work is concentrated on a phoneme recognition using 32ms framing. Isolated word speaker recognition is proposed here to apply on whole signal, “no framing is conducted on the speech signal”. 2 Wavelet Transform Wavelet analysis is a relatively new mathematical discipline. Wavelets have the ability to analysis different parts of a signal at different scales. Wavelet transform provides a variable timefrequncy representation of signals [9]. Fourier transform gives the spectral content of the signal, but it gives no information regarding where in time those spectral components appear. Therefore it is not suitable for non-stationary signals “such as speech” analysis. In STFT, the signal is divided into small segments where the segments of the signal can be assumed to be stationary. In this method, narrow windows give good time resolution, but poor frequency resolution. On the other hand, wide windows give good frequency resolution, but poor time resolution. The wavelet transform solves the dilemma of the resolution to a certain extent. Wavelet transform is designed to give good time resolution and poor frequency resolution at high frequencies and good frequency resolution and poor time resolution at low frequencies. Fig.1 Full decomposition wavelet packet tree Figure 1 demonstrates a three level wavelet packet decomposition of a discrete input signal using appropriate H and G low pass and high pass filters, respectively. Each H and G filter output, followed by down sampling by 2, is a wavelet packet. The wavelet packet at certain decomposition level has the same scale but different time translation [11]. 3 Materials and Methodology 3.1 Experiment Setup All data set are recorded with microphone in laboratory environment using 16 kHz sampling rate. The data consist of 10 speakers, 5 of which are male and the other female. The speakers speaking English word “ENTER”. For each speaker, 5 signals are used for training. For testing the system 15-25 of other recorded signals are used. 3.2 End point detection Due to the noise, which occurred at the start and the end of the speech signal ‘silence’ the boundaries ‘start and end’ of the speech signal must be detected and then remove the silence part. End Point Detection is used to detect the boundaries. Figure 2 shows the speech signal before applying EPD algorithm and the speech signal after removing the silence ‘noise’ 2.1 Wavelet Packet Transform The DWT is really a subset of the Wavelet Packet Transform. Wavelet packet [10] decomposition is a generalization of the discrete wavelet transform (DWT) and of the multi resolution analysis. The wavelet packet decomposition (WPD) of a signal can be viewed as a step by step transformation of the signal from time domain to frequency domain. The top level of WPD tree is the time representation of the signal. As each level of the tree is traversed, there is an increase in the trade off between time and frequency resolution. The bottom level of a fully decomposed tree is a frequency representation of the signal. 0.3 0.3 0.2 0.2 0.1 0.1 0 0 -0.1 -0.1 -0.2 -0.2 -0.3 -0.3 -0.4 0 1 2 3 -0.4 0 5000 10000 15000 4 x 10 Fig. 2 End Point Detection 3.3 Feature Extraction Figure 3 describes a block diagram of the feature extraction task. First, WPT is used to decompose the speech signal into sub bands. After that some bands are chosen and the log energy of chosen band taken. Finally, the DCT for chosen bands are taken. Speech signal WPT 24 Features Fifth, applying two levels WP decomposition on terminal node 9 band which producing four bands. Sixth, WP decomposes band 10 once. So, two bands come out. Log Energy Bands DCT Fig. 4 Tree of the decomposed signal. Fig. 3 Block diagram of feature extraction stage The WPT is used for partitioning the frequency axis of the speech signal similar to the Mel scale. The signal is split into sub band according to [5]. 3.3.1 Decomposing the Speech Signal The second step is to analysis this signal using wavelet packet transforms. There are many wavelet transform families, which are used for wavelet transform. The debouche (db) family wavelet transforms is chosen for this application. The decomposition applies on this signal as follow: Finally, the four remaining bands are not further decomposed. Therefore the final number of bands produced by WP is 24 bands. From the previous steps we generate 24 bands. The first 20 bands are taken and then go for further analysis, which is described in the following sections. 3.3.2 Log Energy bands calculation The energy of each band is calculated by taking the sum of the square value of the coefficients of that band. i n E i c( n ) 2 (1) Fist, a full three level decomposition is carried out. The following figure ‘Figure 4’ shows the complete 3 level decomposition tree of the signal. The figure shows the decomposition three on the left side and the signal at appropriate node on the right side (in this figure node 7). Where: n= number of coefficients in the band. c (n)=the wavelet coefficients. Then take the log of each energy band. Secondly, further decompose the lowest band by applying again full three level WP decompositions. The lowest band is the node number 7 which also can be represented as (3, 0) node. So, this decomposition produces 8 sub-bands. 3.3.3 Compute the cosine transforms Take the discrete cosine transform (dct) for each chosen band, The following equation describes the extracted feature [4]: Thirdly, the frequency band of node 8 further decomposes by applying two levels WP decomposition, thereby giving four sub-bands. L n(i 0.5) coef log E i cos , n 1,.., L (2) L i 1 Forthly, the two lowest bands of the previous decomposition further decomposed into two levels WP decomposition. Therefore the number of bands which are produced by the third and the forth are six. Now, the features for each speaker are calculated. The first part of the recognition is done.Sstill how do we classify who is speaking. The following section describes the method, which has been i 1 Input 1 Input 2 applied in this paper. The output Input 3 3.4 Classification Our approach in the classification paradigm is to apply FFNN. Instead of using one NN for all speakers one NN is created for each speaker. Each NN is designed to make one decision as to whether or not the test utterance is the target speaker for that model or not. Whilst it is feasible for one network to be capable of deciding between multiple targets, for particular reasons that approach is not adopted in order to contain the degree of complexity to a manageable level. The idea behind this is, if either one of the target speakers is changed or a new target added, this would require the network to be retrained on all speaker, then a new network needs to be developed. For a new speaker that we want to add to the group, there will be no need to train the NN. Just makes a new NN for the coming one. Figure 5 describes the proposed NN diagram. Input 4 Hidden layer 2 Input 20 Hidden layer 1 Fig.6 NN structure for classification We trained all the networks using five signals. For these networks the output target is assigned to 0.5. In the testing we use a certain threshold, which is calculated as following: d Y 0.5 (3) Where: Y: the output of the network. d: deviation from the target output (0.5). Net1 Inputs Net2 Output Net3 NetN Fig. 5 A Simple diagram of the proposed NN. The NN contains 20 inputs “coefficients”, two hidden layer, and one output. The first hidden layers consist of a number of weights and the second one consists of three weights. And one output for the each NN. Figure 6 describe the structure of the NN used in this paper. For each speaker we use five speech signals to train its network. After training the network, 15-25 speech signals for each speaker are used to test the system. We check the deviation for all speakers. If the deviation of all speakers is greater than the threshold, the person who is trying to access the system is considered an impostor. If not, then the speaker who was trying to access the system must be the one who has the small deviation. 4 Experimental Results Our experiments were conducted on 10 speakers. For each speaker 20 signals of the English word “Enter” are recorded. Five of these signals are used for training the system and the other 15-25 signals used for testing the system. False Identification Error (FIE) is calculated for each speaker. FIE is presented in Table 1. Table 1 shows the false identification error, which is calculated for ten speakers using our approach. The minimum error obtained is 0% for speaker #2. On the other hand, the maximum error is obtained from speaker # 5 as 16%. The errors for the other speakers are between 0-16 %. From these results the average identification accuracy for the 10 speakers is 89.9%. In [7] the average accuracy was 81%, which used the DWT for extracting the feature from the speech signal and the Adaptive Resonance Theory (ART) family of neural network in classifying the extracted feature. Speaker number Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5 Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10 % of false identification error 6.66% 0% 6.66% 0% 16% 13% 16% 10% 16% 6.66% Table 1 False identification errors 5 Conclusion In this paper we introduced a feature extraction method using WPT. The WPT is applied on whole speech signal, ‘no framing’. This approach gives a good result for text-dependent speaker recognition. In some case it gives 0% false rejection error. But if the number of speaker is increased, it may cause an increase in the false identification error. However this system has to be applied for small number of speakers rather than in large numbers. On the other hand using the WPT shows an improved result over the DWT. References [1] Douglas A. Reynolds, “An Overview of Automatic Speaker Recognition Technology”, IEEE, 2002, pp. 4072-4075. [2] Boja Imperl, “ Speaker recognition techniques”, Laboratory for Digital Signal Processing, Faculty of Electrical Engineering and Comp. Sci, Smetanova 2000 Maribor, Slovenia. [3] S.K Singh, “Feature and Techniques for Speaker Recognition”, M.Tech Credit Seminar Report, electronic system Group, EE Dept, IIT Bombay, Nov. 2003. [4] Joseph P. Campbell, “Speaker Recognition: A Tutorial”, IEEE, Vol.85, No 9, September 1997, pp. 1437-1462. [5] R.T. Ogden, “Essential Wavelet for statistical Application and Data Analysis”, Birkhauser, 1997. [6] O. Farooq and S. Datta, “Phoneme Recognition Using Wavelet Based Features”, [7] Siew chang Woo, Chee Peng Lim, R. Osman, “ Development of a speaker Recognition System using Wavelets and Artificial Neural Networks”, International Symposium on Intelligent Multimedia, Video and speech processing”, 2001. [8] O. Farooq and S. Datta, “Mel Filter-like Admissible Wavelet Packet Structure for Speech Recognition”, IEEE, Nol.8, No.7, July 2001, pp. 196-198. [9] M.A Trenas, and et al, “Architecture for Wavelet Packet Transform with Best Tree Searching” IEEE, 2000. [10] S. Mallat. “ A wavelet tour of signal processing”, Academic press, 1998. [11] Ruhi Sarikaya, Bryan L. Pellom and John H. L. Hansen, “Wavelet Packet Transform Features With Application To Speaker Identification”, in IEEE Nordic Signal Proc. Symp., 1998