ECE 539 Final Project Speech Recognition Christian Schulze 902 227 6316 freeslimer@gmx.de 1.Introduction 1.1.Problem Statement Design of a speech recognition Artificial Neural Network that is able to distinguish between the figures 0 to 9 and the words “yes” and “no”. 1.2.Motivation 1.2.1.Applications This system can be very useful for cellular phones so that the user only has to say the digits to dial the desired telephone number. In addition, it can be used in elevators to get to the desired floors without having to push any buttons. It can be used almost everywhere that choices require number selections 1.2.2.Comparison to huge systems Conventional speech recognition systems contain huge libraries of all the stored patterns of the words to be recognized. This requires, of course, a large memory to store the data. In addition these systems need a fast processor responsible for the non-linear time adjustment of target patterns to the respective words to be recognized and their comparison to the patterns. After the comparison an error is calculated which represents the distance between a pattern and a word. The pattern with the smallest error will be the solution for the recognition. These things are too expensive for such small application. That is why I tried to train a neural network that can do the same. 1.3.Pre-observation of the Artificial Neural Network I designed a Multilayer Perceptron (MLP) Network using the Back Propagation Algorithm of Professor Hu to solve this classification problem. Corresponding to the number of words I want to distinguish at the end, this is a question of 12 different classes. This limits the number of outputs of the network to 12. The input data are extracted feature values that represent the signal. Here I always store the first two formants of the spectrum of a signal’s part corresponding to the first two maxima of the spectrum. That is why the process is also called “Two-Formants-Recognition-System”. I implemented the whole algorithm using Matlab. 2.Work Performed 2.1.Listing of Steps 2.1.1.Raw data collection Recording of 50 samples for every word using “Cool Edit” o Sampling frequency : 44100 kHz o Channels : mono o Resolution : 32 bit (quantization of the amplitude values) Storage of each sample in its own text file (stored amplitude values for all time values) Summary of all text file names in a vector “text_files” saved in “textf.mat” using the function “textfiles.m” 2.1.2.Extraction of input data Implementation of the text file vector to the function “spectrum_complete.m” o Adjustment of all signals to uniform signal length of 500 ms o If real signal length smaller than 500 ms missing values filled with zeros No time stretching or shrinking No change of the feature data Windowing of the signal every 10 ms to parts of length of 20 ms using the Hann window (multiplication of signal with the window) => 49 separate sections for each sample Calculation of the spectrum of these sections using Discrete Fourier Transformation (whole spectrum is periodical with 22050 Hz, but only the first 4 kHz are of interest) Implementation of the Cepstral Algorithm for the calculation of the spectrum’s contour (the spectrum becomes smoothed) o Calculation of the Cepstral Coefficients by applying the Inverse Discrete Fourier Transformation after using the natural logarithm on the spectral coefficients => change to “quefrency” domain (kind of time) o Front values contain information of the high frequencies, back values contain the information of the low frequencies o Short pass liftering of the Cepstral Coefficients (taking only the first 147 Cepstral Coefficients) o Calculation of the smoothed spectrum by applying the Discrete Fourier Transformation to the Cepstral Coefficients Storage of the first two frequencies for the two first formants in a vector => in this example these two values are 441 and 804 o The vector for every sample becomes 98 by 1 o Connection of this vector with the desired values corresponding to the respective class and storage in a matrix o This matrix consists of 50 of these vectors for every word so that its dimension is 600 by 110. o Storage of these “input_data” the file “input.mat” 2.1.3.Input data for the MLP-Network Scaling of the input data to [-1,1] using “scale_data.m” Random division of the “input_scaled.mat” data into the files “training.mat” and “testing.mat” in a ratio of 9 to 1 using the file “rand_files.m” (integrated in the Back Propagation Algorithm files “bp_all.m” and “bp_ex.m”) Implementation of these input data to the MLP-Network Sizing and optimisation of the network 2.2. Used algorithms and formulas 2.2.1. Discrete Fourier Transformation (DFT) Reference : “Signal Analysis and Recognition” Rüdiger Hoffmann Actually I used Short Time Discrete Fourier Transformation (DFTF), which corresponds to the convolution of the spectra of the signal and the window function. X (n, m) 1 N m n x(k )h(m k ) * exp( j 2 N (k m N 1)) (n 0,..., N 1) k m N 1 where h(k) represents the k-th value of the window function. The same solution occurs when multiplying the signal with the window function and applying the normal DFT to the product. X ( n) 1 N 1 nk x(k ) exp( j 2 ) N k 0 N DFT (FFT) 2.2.2. Cepstral Algorithm Reference : “Signal Analysis and Recognition” Rüdiger Hoffmann Human speech consists of two different parts. At first the vocal chords produce an origin signal that has a special frequency. This frequency is about 150 Hz for men and about 300 Hz for women. After the production of the origin signal, the air is modulated by the vocal tract that consists of the pharyngeal, the oral and nasal cavities. This modulation causes the high frequencies in the voice. The maxima of these frequencies represent the different tones. The Cepstral Algorithm is used to separate these two signal parts. The modulation of the signal corresponds to a convolution of the signal with the system’s function of the vocal tract. This equals to a multiplication of both spectra. F{x# g} F{x}* F{g} X ( ) * G( ) Y ( ) # corresponds to the convolution operator F represents the Fourier Transformation Applying the Inverse Fourier Transformation after using the logarithm of that expression, separates the two parts of the speech, because they concentrate on different ranges in the time domain. F 1{log Y ( )} F 1{log [ X ( ) * G( )]} F 1{log X ( ) log G( )} F 1{log X ( )} F 1{log G( )} The production of a real signal enables the use of the “Real Cepstrum” which is defined in that way: c( ) F 1{ln | X ( ) | ²} if x is an energy signal The calculation of the Cepstral Coefficients causes a change into the “quefrency” domain, a kind of logarithmic time domain. The information of the articulation are encoded in the smaller coefficients. That is why they are applied to a short pass “lifterer” with the border coefficient “c_border”. The corresponding coefficient to the respective frequency can be calculated in the following way: c ≈ sample frequency / frequency border frequency = 300 Hz => c_border ≈ 44100 Hz / 300 Hz =147 So the first 147 coefficients are taken and the DFT applied to them. After that the logarithm is cancelled by using the exponential function and calculating the root of the expression to get the spectral coefficients of the smoothed spectrum (contour). With this contour it is possible to determine the maxima - the formants - of which the first two are the representative features I used. 2.3. Optimisation of MLP I chose a Multilayer Perceptron-Network and used the Back Propagation Algorithm of Professor Yu Hen Hu to solve this classification problem. The network consists of 98 inputs which result of 49*2 formants for each sample and 12 outputs corresponding to the number of different words. At first I tried to design a two-word recognition system to get an idea of the size of the net. This resulted in a net that had 3 hidden layers each with 12 hidden neurons to let the algorithm converge with a classification rate of 100%. These values I also used for the classifier for the 12 words in the beginning, and after many tries and tests I established that these dimensions cause the best results for the classification. Then I tried to optimise parameter after parameter to get a better result. I began with the number of samples per epoch and fixed it to 8. Afterwards I changed the number of epochs and the number of epochs before convergence test to 200000 and 150. After that, I experimented with the learning rate and the momentum and realized that it is much better to shrink both. So I finally used 0.01 for the learning rate and 0 for the momentum. This seems to be similar to the problem of choosing a smaller step size for the LMS-Algorithm to shrink the misadjustment – actually the Error Correcting Learning is a kind of LMS-Algorithm. Subsequently, I tried to find a better solution by changing the scaling range of my input data. I established that it is very useful to take a range of [-1,1] instead of [-5,5]. With these changes the classification rate reached 87% for training and only 65% after testing. That was a large progress, but a testing rate of 65% isn’t useful for a speech recognition system, so I had to try to improve upon it. The distinction between training and testing seemed to be caused by the similarities of the words zero and seven, which the network had problems deciding between. Also the word four made small problems. I tried to use a tuning set of 10% or 30% size of the input data, but that didn’t make any improvements. That’s why I thought about designing a neighbour network - an expert network - which is responsible for the distinction between the words zero, four, and seven if the main network gets a maximum output value for one of these words. That caused a large increase of the classification rate for testing after I had connected both networks together. 2.4.Program files My programs can be tested in setting the path of “Matlab” to the respective folders and in starting the demo files. (Any overwritten necessary mat-files can be restored out of the folder “Original_data”) Folder 1_Data_Collection o Shows the data collection in running “data_collection_demo.m” Folder Speech_Samples (contains the text files of the samples) data_collection_demo.m getdata.m maximum.m scale_data.m spectrum_complete.m textfiles.m window_functions.m Folder 2_Training o Folder Main_network Shows the training process of the main network in running “training_main_demo.m” actfun.m actfunp.m bp_all.m bpconfig_all.m bpdisplay.m bptest.m cvgtest.m rand_files.m randomize.m rsample.m scale.m training_main_demo.m input_scale.mat o Folder Expert_network Shows the training process of the expert network in running training_expert_demo.m” actfun.m actfunp.m bp_ex.m bpconfig_ex.m bpdisplay.m bptest.m cvgtest.m data_choice.m rand_files.m randomize.m rsample.m scale.m training_expert_demo.m input_scale.mat Folder 3_Recognition o Shows the recognition of a word in running “recognition_demo.m” actfun.m getdata.m maximum.m recognition_demo.m scale_data_2.m spectrum_single.m window_functions.m weights_all.mat weights_ex.mat final_test_classification_rate.txt (shows the recognition results of all 600 used samples) test.txt (appropriate sample “eight”) 3. Final Results 3.1.Final MLP final output expert network logical unit which combines outputs main network ... 12 outputs ... ... 12 hidden neurons ... ... 12 hidden neurons ... ... 12 hidden neurons 3 outputs ... 3.2.Parameter and Results 3.2.1.Main network (distinguishes between all words) inputs : 98 outputs : 12 hidden layer : 3 hidden neurons per hidden layer : 12 learning rate : 0.01 momentum : 0 samples per epoch : 8 number of epochs before convergence test : 150 number epochs until best solution : 200000 classification rate after training : 87.6% classification rate after testing : 71.7% 98 inputs 3.2.2.Expert network (distinguishes between the words 0, 4 and 7) inputs : 98 outputs : 3 hidden layer : 3 hidden neurons per hidden layer : 12 learning rate : 0.01 momentum : 0 samples per epoch : 8 number of epochs before convergence test : 150 number epochs until convergence : 38000 classification rate after training : 100.0% classification rate after testing : 86.7% 3.2.2.Combined network The function “recognition_network” uses the best calculated weights of the training. The logical unit combines the outputs of both networks. If the output of the main network is maximum for 0, 4 or 7 then the expert network is asked; this increases the total classification rate of the main network by about 20%. That means the final classification rate after testing equals to 90.83%. You are able to apply the final function “recognition_demo “ in using any appropriate text file. The name of the text file has to be changed in the function. 4.Discussion I only got a classification rate of 70% by using the main network, although I tried to optimise all parameters, the scaling and so on, that is not enough for a speech recognition system. I think this is because of the input data: there are vectors that are too similar to each other causing overlapping of classes. So I decided myself for combining the main network with an expert network which increased the classification rate to 90.83%. The similarities of the vectors appear because I don’t use any time adjusting algorithms like non-linear mapping of all samples to a standard length. I didn’t want the calculation expenditure to become too large. These are only short words - not longer than one half of a second - so that the shifting between the tones could be kept rather small. I trained the whole network only in using my own speech samples. So the results for samples of other voices can differ much if the person speaks to fast or too slowly. The worst results are achieved if a female voice is tested, because the frequencies lie in a higher range in comparison to a male voice. I tested my algorithm on some persons and the results oscillate quite much. The results that I got could be improved on of course by implementing another expert network (e.g. for the words 1, 9 and no), but instead of 20% improvement, of the classification rate in using this one expert network it would cause only about 1.5 % improvement that would be not very effective in comparison to increasing the network size. But that depends on the size of the network that may be used. In addition I realized that a single recognition of one word often takes too long. That’s probably because I used many loops in my program to implement my algorithm, but this is a point that was difficult for me to improve. Altogether I can be quite satisfied with these results, but single aspects of it might be improved to better its performance. Appendix: References: “Signal Analysis and Recognition” Rüdiger Hoffmann “Technical Speech Communication” (Lecture notes) University of Technology Dresden (Institute for Acoustics and Speech Communication)