ECE 539 Final Project

advertisement
ECE 539 Final Project
Speech Recognition
Christian Schulze
902 227 6316
freeslimer@gmx.de
1.Introduction
1.1.Problem Statement
 Design of a speech recognition Artificial Neural Network that is able to distinguish
between the figures 0 to 9 and the words “yes” and “no”.
1.2.Motivation
1.2.1.Applications



This system can be very useful for cellular phones so that the user only has to say the
digits to dial the desired telephone number.
In addition, it can be used in elevators to get to the desired floors without having to
push any buttons.
It can be used almost everywhere that choices require number selections
1.2.2.Comparison to huge systems
Conventional speech recognition systems contain huge libraries of all the stored patterns of
the words to be recognized. This requires, of course, a large memory to store the data. In
addition these systems need a fast processor responsible for the non-linear time adjustment of
target patterns to the respective words to be recognized and their comparison to the patterns.
After the comparison an error is calculated which represents the distance between a pattern
and a word. The pattern with the smallest error will be the solution for the recognition.
These things are too expensive for such small application.
That is why I tried to train a neural network that can do the same.
1.3.Pre-observation of the Artificial Neural Network
I designed a Multilayer Perceptron (MLP) Network using the Back Propagation Algorithm of
Professor Hu to solve this classification problem. Corresponding to the number of words I
want to distinguish at the end, this is a question of 12 different classes. This limits the number
of outputs of the network to 12.
The input data are extracted feature values that represent the signal. Here I always store the
first two formants of the spectrum of a signal’s part corresponding to the first two maxima of
the spectrum. That is why the process is also called “Two-Formants-Recognition-System”.
I implemented the whole algorithm using Matlab.
2.Work Performed
2.1.Listing of Steps
2.1.1.Raw data collection



Recording of 50 samples for every word using “Cool Edit”
o Sampling frequency : 44100 kHz
o Channels
: mono
o Resolution
: 32 bit (quantization of the amplitude values)
Storage of each sample in its own text file (stored amplitude values for all time values)
Summary of all text file names in a vector “text_files” saved in “textf.mat” using the
function “textfiles.m”
2.1.2.Extraction of input data

Implementation of the text file vector to the function “spectrum_complete.m”
o Adjustment of all signals to uniform signal length of 500 ms
o If real signal length smaller than 500 ms missing values filled with zeros
 No time stretching or shrinking
 No change of the feature data

Windowing of the signal every 10 ms to parts of length of 20 ms using the
Hann window (multiplication of signal with the window)
=> 49 separate sections for each sample

Calculation of the spectrum of these sections using Discrete Fourier Transformation
(whole spectrum is periodical with 22050 Hz, but only the first 4 kHz are of interest)

Implementation of the Cepstral Algorithm for the calculation of the spectrum’s
contour (the spectrum becomes smoothed)
o Calculation of the Cepstral Coefficients by applying
the Inverse Discrete Fourier Transformation after using the natural logarithm
on the spectral coefficients => change to “quefrency” domain (kind of time)
o Front values contain information of the high frequencies, back values contain
the information of the low frequencies
o Short pass liftering of the Cepstral Coefficients
(taking only the first 147 Cepstral Coefficients)
o Calculation of the smoothed spectrum by applying the
Discrete Fourier Transformation to the Cepstral Coefficients

Storage of the first two frequencies for the two first formants in a vector
=> in this example these two values are 441 and 804
o The vector for every sample becomes 98 by 1
o Connection of this vector with the desired values corresponding to the
respective class and storage in a matrix
o This matrix consists of 50 of these vectors for every word so that its dimension
is 600 by 110.
o Storage of these “input_data” the file “input.mat”
2.1.3.Input data for the MLP-Network

Scaling of the input data to [-1,1] using “scale_data.m”

Random division of the “input_scaled.mat” data into the files “training.mat” and
“testing.mat” in a ratio of 9 to 1 using the file “rand_files.m”
(integrated in the Back Propagation Algorithm files “bp_all.m” and “bp_ex.m”)

Implementation of these input data to the MLP-Network

Sizing and optimisation of the network
2.2. Used algorithms and formulas
2.2.1. Discrete Fourier Transformation (DFT)
 Reference :
“Signal Analysis and Recognition”
Rüdiger Hoffmann
Actually I used Short Time Discrete Fourier Transformation (DFTF), which corresponds to
the convolution of the spectra of the signal and the window function.
X (n, m) 
1
N
m
n
 x(k )h(m  k ) * exp(  j 2 N (k  m  N  1))
(n  0,..., N  1)
k  m  N 1
where h(k) represents the k-th value of the window function.
The same solution occurs when multiplying the signal with the window function and applying
the normal DFT to the product.
X ( n) 
1 N 1
nk
x(k ) exp(  j 2 )

N k 0
N
DFT (FFT)
2.2.2. Cepstral Algorithm
 Reference :
“Signal Analysis and Recognition”
Rüdiger Hoffmann
Human speech consists of two different parts. At first the vocal chords produce an origin
signal that has a special frequency. This frequency is about 150 Hz for men and about 300 Hz
for women. After the production of the origin signal, the air is modulated by the vocal tract
that consists of the pharyngeal, the oral and nasal cavities. This modulation causes the high
frequencies in the voice. The maxima of these frequencies represent the different tones.
The Cepstral Algorithm is used to separate these two signal parts.
The modulation of the signal corresponds to a convolution of the signal with the system’s
function of the vocal tract. This equals to a multiplication of both spectra.
F{x# g}  F{x}* F{g}  X ( ) * G( )  Y ( )
# corresponds to the convolution operator
F represents the Fourier Transformation
Applying the Inverse Fourier Transformation after using the logarithm of that expression,
separates the two parts of the speech, because they concentrate on different ranges in the time
domain.
F 1{log Y ( )}  F 1{log [ X ( ) * G( )]}  F 1{log X ( )  log G( )}
 F 1{log X ( )}  F 1{log G( )}
The production of a real signal enables the use of the “Real Cepstrum” which is defined in
that way:
c( )  F 1{ln | X ( ) | ²}
if x is an energy signal
The calculation of the Cepstral Coefficients causes a change into the “quefrency” domain, a
kind of logarithmic time domain. The information of the articulation are encoded in the
smaller coefficients. That is why they are applied to a short pass “lifterer” with the border
coefficient “c_border”. The corresponding coefficient to the respective frequency can be
calculated in the following way:
c ≈ sample frequency / frequency
border frequency = 300 Hz
=>
c_border ≈ 44100 Hz / 300 Hz =147
So the first 147 coefficients are taken and the DFT applied to them. After that the logarithm is
cancelled by using the exponential function and calculating the root of the expression to get
the spectral coefficients of the smoothed spectrum (contour).
With this contour it is possible to determine the maxima - the formants - of which the first two
are the representative features I used.
2.3. Optimisation of MLP
I chose a Multilayer Perceptron-Network and used the Back Propagation Algorithm of
Professor Yu Hen Hu to solve this classification problem.
The network consists of 98 inputs which result of 49*2 formants for each sample and 12
outputs corresponding to the number of different words.
At first I tried to design a two-word recognition system to get an idea of the size of the net.
This resulted in a net that had 3 hidden layers each with 12 hidden neurons to let the
algorithm converge with a classification rate of 100%.
These values I also used for the classifier for the 12 words in the beginning, and after many
tries and tests I established that these dimensions cause the best results for the classification.
Then I tried to optimise parameter after parameter to get a better result.
I began with the number of samples per epoch and fixed it to 8.
Afterwards I changed the number of epochs and the number of epochs before convergence
test to 200000 and 150.
After that, I experimented with the learning rate and the momentum and realized that it is
much better to shrink both. So I finally used 0.01 for the learning rate and 0 for the
momentum. This seems to be similar to the problem of choosing a smaller step size for the
LMS-Algorithm to shrink the misadjustment – actually the Error Correcting Learning is a
kind of LMS-Algorithm.
Subsequently, I tried to find a better solution by changing the scaling range of my input data. I
established that it is very useful to take a range of [-1,1] instead of [-5,5].
With these changes the classification rate reached 87% for training and only 65% after testing.
That was a large progress, but a testing rate of 65% isn’t useful for a speech recognition
system, so I had to try to improve upon it.
The distinction between training and testing seemed to be caused by the similarities of the
words zero and seven, which the network had problems deciding between. Also the word four
made small problems.
I tried to use a tuning set of 10% or 30% size of the input data, but that didn’t make any
improvements.
That’s why I thought about designing a neighbour network - an expert network - which is
responsible for the distinction between the words zero, four, and seven if the main network
gets a maximum output value for one of these words. That caused a large increase of the
classification rate for testing after I had connected both networks together.
2.4.Program files
My programs can be tested in setting the path of “Matlab” to the respective folders and in
starting the demo files.
(Any overwritten necessary mat-files can be restored out of the folder “Original_data”)



Folder 1_Data_Collection
o Shows the data collection in running “data_collection_demo.m”

Folder Speech_Samples (contains the text files of the samples)

data_collection_demo.m

getdata.m

maximum.m

scale_data.m

spectrum_complete.m

textfiles.m

window_functions.m
Folder 2_Training
o Folder Main_network

Shows the training process of the main network in running “training_main_demo.m”

actfun.m

actfunp.m

bp_all.m

bpconfig_all.m

bpdisplay.m

bptest.m

cvgtest.m

rand_files.m

randomize.m

rsample.m

scale.m

training_main_demo.m

input_scale.mat
o Folder Expert_network

Shows the training process of the expert network in running training_expert_demo.m”

actfun.m

actfunp.m

bp_ex.m

bpconfig_ex.m

bpdisplay.m

bptest.m

cvgtest.m

data_choice.m

rand_files.m

randomize.m

rsample.m

scale.m

training_expert_demo.m

input_scale.mat
Folder 3_Recognition
o Shows the recognition of a word in running “recognition_demo.m”

actfun.m

getdata.m

maximum.m

recognition_demo.m

scale_data_2.m

spectrum_single.m

window_functions.m

weights_all.mat

weights_ex.mat

final_test_classification_rate.txt (shows the recognition results of
all 600 used samples)

test.txt (appropriate sample “eight”)
3. Final Results
3.1.Final MLP
final output
expert network
logical unit which combines outputs
main network
...
12 outputs
...
...
12 hidden
neurons
...
...
12 hidden
neurons
...
...
12 hidden
neurons
3 outputs
...
3.2.Parameter and Results
3.2.1.Main network
(distinguishes between all words)









inputs : 98
outputs : 12
hidden layer : 3
hidden neurons per hidden layer : 12
learning rate : 0.01
momentum : 0
samples per epoch : 8
number of epochs before convergence test : 150
number epochs until best solution : 200000


classification rate after training : 87.6%
classification rate after testing : 71.7%
98 inputs
3.2.2.Expert network
(distinguishes between the words 0, 4 and 7)









inputs : 98
outputs : 3
hidden layer : 3
hidden neurons per hidden layer : 12
learning rate : 0.01
momentum : 0
samples per epoch : 8
number of epochs before convergence test : 150
number epochs until convergence : 38000


classification rate after training : 100.0%
classification rate after testing : 86.7%
3.2.2.Combined network
The function “recognition_network” uses the best calculated weights of the training.
The logical unit combines the outputs of both networks. If the output of the main network is
maximum for 0, 4 or 7 then the expert network is asked; this increases the total classification
rate of the main network by about 20%.
That means the final classification rate after testing equals to 90.83%.
You are able to apply the final function “recognition_demo “ in using any appropriate text
file. The name of the text file has to be changed in the function.
4.Discussion
I only got a classification rate of 70% by using the main network, although I tried to optimise
all parameters, the scaling and so on, that is not enough for a speech recognition system.
I think this is because of the input data: there are vectors that are too similar to each other
causing overlapping of classes. So I decided myself for combining the main network with an
expert network which increased the classification rate to 90.83%.
The similarities of the vectors appear because I don’t use any time adjusting algorithms like
non-linear mapping of all samples to a standard length. I didn’t want the calculation
expenditure to become too large. These are only short words - not longer than one half of a
second - so that the shifting between the tones could be kept rather small.
I trained the whole network only in using my own speech samples. So the results for samples
of other voices can differ much if the person speaks to fast or too slowly. The worst results are
achieved if a female voice is tested, because the frequencies lie in a higher range in
comparison to a male voice. I tested my algorithm on some persons and the results oscillate
quite much.
The results that I got could be improved on of course by implementing another expert network
(e.g. for the words 1, 9 and no), but instead of 20% improvement, of the classification rate in
using this one expert network it would cause only about 1.5 % improvement that would be not
very effective in comparison to increasing the network size. But that depends on the size of
the network that may be used.
In addition I realized that a single recognition of one word often takes too long. That’s
probably because I used many loops in my program to implement my algorithm, but this is a
point that was difficult for me to improve.
Altogether I can be quite satisfied with these results, but single aspects of it might be
improved to better its performance.
Appendix:
References:
“Signal Analysis and Recognition”
Rüdiger Hoffmann
“Technical Speech Communication”
(Lecture notes)
University of Technology Dresden
(Institute for Acoustics and Speech Communication)
Download