VOICE RECOGNITION SYSTEM IN NOISY ENVIRONMENT Vinit D Patel

advertisement
VOICE RECOGNITION SYSTEM IN NOISY ENVIRONMENT
Vinit D Patel
B.S., Parul Institute of Engineering and Technology, India, 2008
PROJECT
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
ELECTRICAL AND ELECTRONIC ENGINEERING
at
CALIFORNIA STATE UNIVERSITY, SACRAMENTO
SPRING
2011
VOICE RECOGNITION SYSTEM IN NOISY ENVIRONMENT
A Project
by
Vinit D Patel
Approved by:
__________________________________, Committee Chair
Jing Pang, Ph.D.
__________________________________, Second Reader
Manish Gajjar, M.S., MBA
____________________________
Date
ii
Name of Student: Vinit D Patel
I certify that this student has met the requirements for format contained in the University
format manual, and that this project is suitable for shelving in the Library and credit is to
be awarded for the Project.
__________________________, Graduate Coordinator
Preetham Kumar, Ph.D.
Department of Electrical and Electronic Engineering
iii
________________
Date
Abstract
of
VOICE RECOGNITION SYSTEM IN NOISY ENVIRONMENT
by
Vinit D Patel
Human life could be much more comfortable, if global machinery works on voice
commands. There are lot of products available in the market those provide us automated
voice controlled working experience. Voice-identification is a kind of technology in
which, machine responds to human voice. The goal of this project is to design and
implement voice controlled embedded system in noisy environment. The project consists
software and hardware design. The software part is to eliminate noise from the original
signal and develop the speech recognition algorithm (DTW). The hardware is used for
speech recognition.
In this project, DTW algorithm is developed to study and to research the
implementation of the speech recognition for single-word.
In addition, in this project I implemented nearest neighbor algorithm followed by
DTW, which helps to match the speech with different people’s accents. For really good
result I used wiener filter to increase signal to noise ratio of applied signal. Software
portion is fully developed in MATLAB environment.
iv
At the other end, hardware part was concentrated on comparing and processing of
applied speech with pre-stored speech signal using HM2007. Also after the processing,
output of the hardware controls the system according to the speech.
_______________________, Committee Chair
Jing Pang, Ph.D.
_______________________
Date
v
DEDICATION
I would like to dedicate this report to my mom. I have only been able to
accomplish what I have thanks to her inspiration and sacrifice. She always said, “Work
harder, God will be there to help you.” She lived that out every day of her life.
Indira (aka Bharti) Patel (1963-2009)
vi
ACKNOWLEDGMENT
This space provides me with a great opportunity to thank all the people without
whom this project would never been possible. I would take this opportunity to convey my
sincere thanks to all of them.
First, I would like to thank you to Dr. Jing Pang, Associate Professor of Electrical
and Electronic Engineering Department at California State University Sacramento for
being my Project guide and guiding me when needed and always encouraging me to
implement new things. She took the time out of her busy schedule to provide me with the
valuable suggestions regarding project as well as for the project report.
I would like to thank my Dad, whose constant tinkering, obsessive organization,
and ingenuity gave me the mind and attitude of a true engineer.
Special thanks to Mr. Manish Gajjar, Instructor of Electrical and Electronic
Engineering Department at California state University Sacramento for being my second
project guide. Also, like to thanks Dr. Preetham Kumar, Graduate Coordinator of
Electrical and Electronic Engineering Department at California State University
Sacramento for providing me this wonderful opportunity of gaining the best knowledge. I
also want to thank all my friends, without whom, I would have never been able to
complete my project on time. They were always there to help me and guide me whenever
I had any doubts. Their constant support and encouragement helped me to work even
harder. Without their continuous support, it would have been difficult for me to complete
the project on time.
vii
TABLE OF CONTENTS
Page
Dedication……….…...……………………………...…………………………………....vi
Acknowledgments…...……………………………...…………………………………...vii
List of Tables …………………………………………………………………………......x
List of Figures…………………………………………………………………………….xi
Chapter
1. INTRODUCTION ……………………………………………………………...……..1
1.1 Overview……….…..…………………..……………………………….……….....1
1.2 Purpose of Project...……………..……………………………………….……...…2
1.3 Applications of Voice Recognition Embedded System………………….……...…4
2. SPEECH RECOGNITION………………..……………………………….……….......5
2.1 Classification of Speech Recognition System..………....……………….................5
2.2 Basic Composition of Speech Recognition System…...…….....…………………...6
2.3 Software Results………………………………….…...…….....…………………...8
3. WIENER FILTER………...….……………………………………………....………...9
3.1 Introduction……………….. …………..…………………………………………...9
3.2 Block Diagram and Mathematical Solution……..………………………………...10
3.3 Wiener Filters Input-Output Relationships…..…..………………………………..12
3.4 The Wiener-Hopf Technique……………..……………………………….............15
3.5 Wiener Filter Solution…………..………………………………………………...16
3.6 Simulation Result………………..………………………………………………...18
4. DYNAMIC TIME WARPING ……………………………………………………….19
4.1 Overview………………..…………………………..………………….………....19
4.2 DTW Algorithm…………….....………………………………………………….20
4.3 Mel-Frequency Cepstrum Coefficients...……..…………………………………..22
4.4 Applications……………..…...…………………………………………...............23
viii
4.5 Distance Matrix Coefficient Result….…………………………………...............24
5. K-NEAREST NEIGHBOR ALGORITHM…..……………………………………....25
5.1 Introduction………… ……..………………...……………….…………………..25
5.2 Assumptions in KNN………………...………..……………………….………… 27
5.3 KNN for Density Estimation…..…...………...……………….…………………..28
5.4 KNN Classification…………………………...……………….…………………..29
5.5 Implementation of KNN………..…………..……………………….………….…32
5.6 Results………………………..…………..……………………….………….……32
6. HARDWARE IMPLEMENTATION...……………………………………………….33
6.1 Overview………………..……………………………………………………….. 33
6.2 Block Diagram of Hardware……...…...………….…………...……………...…..36
6.3 Training and Working of HM-2007……...……….…………...……………...…..36
6.4 Microcontroller Interfacing with Motors...…..…………………………………...38
6.5 Limitation of Hardware…………...….………………...……...……………...…..41
7. CONCLUSION………………………………….….……………….………………...42
Appendix A Software Code.……………………………………….…………………….42
Appendix B Hardware Code.……………………………………….……………………53
Bibliography ..………………………………………………………….………………..55
ix
LIST OF TABLES
Page
Table 1 List of 13 Speech Enhancement Algorithms Evaluated ..……….………..…….. 2
Table 2 Component List ………………………….……………..……….………..……..33
x
LIST OF FIGURES
Page
Figure 1 Speech Recognition Flowchart …………...……………………………………. 7
Figure 2 Basic Diagram and Design of Project …...……….……………………………10
Figure 3 Wiener Filter Block Diagram ……………….…………………………………11
Figure 4 W Matrix Calculation for Single Microphone Case ….…………….………….13
Figure 5 Simulation Result of Wiener Filter……………………………………………..18
Figure 6 DTW Algorithm to Search Path………………....……………………………. 21
Figure 7 Euclidean Distances between Two Vectors Xr and Xs ………………………. 27
Figure 8 Circuit Schematic of SR-07…………………………...…...…………………...34
Figure 9 Block Diagram of Hardware…………………………...…...………………….36
Figure 10 Simple Darlington Transistor………………..……….……………………….39
Figure 11 Microcontroller Interfacing Circuit……………..……..….…………….…… 39
Figure 12 Pictures of Whole Project …………………..…………..….……………....... 40
xi
1
Chapter 1
INTRODUCTION
The theme of social interaction and intelligence is important and interesting to an
Artificial Intelligence and Robotics community [1]. It is one of the challenging areas in
Human-Robot Interaction. Speech recognition technology is a great aid to admit the
challenge and it is a prominent technology for Human-Computer Interaction for the
future. Humans have five classical sensors: vision, touch, smell, taste and hearing, by
which they percept the surrounding world. The main goal of this project is to use hearing
sensor and the speech analysis to the embedded system.
1.1 Overview
Voice Recognition Embedded System would be an advanced control system that
uses human voice/audio speech to identify the speech command. The system would
perform the action according to the given command. Today, many speech recognition
systems and networks are available in market. Many of the companies formed speech
recognition microprocessors too to use in such systems and networks. These kind of
integrated circuitry use pre recorded speech signal as reference to recognize the current
speech command. Such system has different performances in different environments like
noisy and non-noisy. For more efficiency of the design, we have to make system for
worst condition that could be achieved in noisy environment. For such efficient system,
we can remove the noise from the speech command and then apply to the speech
recognition system.
2
1.2
Purpose of Project
In this project as a part of software, I developed nearest neighbor algorithm for
Speech Recognition that is followed by Dynamic Time Wrapping (DTW) calculation.
Still something is missing in this project. This project worked perfectly fine in non-noisy
environment but it got some issues in noisy environment. Then I researched on noise
reduction techniques of speech signal. I found such good paper on different-different
algorithms of speech enhancement [2].
Algorithm
Equation/Parameters
Ref
KLT
pKLT
MMSE-SPU
logMMSE
logMMSE-ne
logMMSE-SPU
pMMSE
RDC
RDC-ne
MB
WavThr
Wiener_as
AudSup
Eq. 14,48
Eq. 34, v = 0.08
Eq. 7,51, q = 0.3
Eq. 20
Eq. 20
Eq. 2,8,10,16
Eq. 12
Eq. 6,7,10,14,15
Eq. 6,7,10,14,15
Eq. 4-7
Eq. 11,25
Eq. 3-7
Eq. 26,38, v(i)=1,2 iterations
[8]
[9]
[10]
[11]
[11]
[12]
[13]
[14]
[14]
[15]
[16]
[4]
[17]
Table 1: List of 13 Speech Enhancement Algorithms Evaluated [2]
3
After studying all of them, I decided to work on wiener filter which is easy to
implement and more averagely efficient in all kinds of noises such as car noise, street
noise, and bubble noise. This filtered input goes to DTW calculation, which helps to find
out nearest neighbor of speech spoken. I implemented whole algorithm in MATLAB.
Also in hardware part, I worked on one of speech recognition microprocessors
HM 2007 to recognize the speech command. I applied the speech signal through
microphone to HM 2007, and it compared the command with pre-recorded speech and
gave the output accordingly. Microcontroller AT 89c52 processed the output signal of
HM 2007 and controlled the motors of the embedded system.
4
1.3
Applications of Voice Recognition Embedded System
Following would be the applications of the system under discussion:
1. Development of educational games and smart toys
2. No Key required for devices such as personal computer and laptops,
automobiles, cell phones, door locks, smart card applications, ATM machines
etc.
3. Support to disabled people
4. Alerts/warning signals during emergencies in airplane, train and/or buses.
5. Automatic payment and customer service support through telephones
5
Chapter 2
SPEECH RECOGNITION
Speech Recognition is important for machine to understand human voices and
perform the action according to human commands. Speech recognition is highly research
object and it is useful in area of pattern recognition, involving physiology, psychology,
linguistics, computer science and signal processing, and many other fields, even to the
people involved in body language (such as when people speak expressions and gestures
and other actions to help each other to understand the behavior).
2.1
Classification of Speech Recognition System
Speech Recognition system, according to different points of view and the scope of
different applications, has different performance requirements of the design. Their
implementations are the following types:
1) Isolated words, conjunctions, continuous speech recognition, and speech
understanding of the conversation systems
2) Large vocabulary and small vocabulary system
3) Specific and non specific speech recognition system
6
2.2
Basic Composition of Speech Recognition System
A typical speech recognition program is shown in Figure 1. Input analog voice
signal is first to go through preprocessing, which includes pre-filtering, sampling,
quantization, windowing, endpoint detection, pre-emphasis and so on. Now next
important part is feature extraction. Characteristics parameters of the requirements are:
1) Extract the characteristic parameters of the representative the representative
voice
2) The order parameters have a good independent
3) Easy to calculate the characteristic parameters to chose high efficient method
to ensure real-time implementation
7
Start
Pre-recorded
Speech
Signals
Enter Speech
Wiener Filter
(Eliminate Noise)
NO
Mel Function
Capstrum
Coefficient
(Calculation)
Create MAT Files
(Reference Speech
Signals)
Comparing Enter Speech
with Stored Speech Signals
Reading MAT
Files for
Reference
Speech
Yes
Result
Display
END
Figure 1: Speech Recognition Flowchart
8
In this project, speech recognition software had been developed which using
wiener filter, DTW and KNN algorithms. Wiener filter was using input signal from
microphone. It was helpful to remove noise from the original signal. It also helped to
increase signal to noise ratio. Now this filtered signal was applied to MFCC to create
coefficients of mel frequency cepstrum.
In this project, first reference file was created for different-different pre recorded
speech signals. When the microphone input signal was applied, its MFC coefficients were
compared to the pre-recorded speech’s MFC coefficients using DTW algorithm.
Output scores of DTW were applied to the KNN algorithm to calculate the nearest
common sound of the five different recorded speech signals. End of the software output
was displayed on MATLAB output screen. Software would display correct speech
command if applied microphone signal would be compared with pre-recorded signals.
2.3
Software Results
Following cases are the MATLAB test result of the software:
Case 1: Press any key to start 2 seconds of speech recording...Recording
speech...Finished recording. System is trying to recognize what you have spoken...
No microphone connected or you have not said anything.
Case 2: Press any key to start 2 seconds of speech recording...Recording
speech...Finished recording. System is trying to recognize what you have spoken...
You just said Forward.
9
Chapter 3
WIENER FILTER
3.1 Introduction
A practical aspect of signal processing shows that it is failed often to resolve
problems to extract signal from noise. Therefore, we need to find the filter, which has socalled optimal linear filter characteristics [7]. So when both signal and noise applied to
such filter, the output signal can be reproduced as accurately as possible with maximum
noise suppression. Wiener Filtering is used to resolve such problems for a class of
extracting single from noise.
Wiener filter was introduced by Norbert Wiener in 1949 [3]. Also it was
introduced independently for the discrete-time case by Kolmogorov [4]. WienerKolmogorov filters have the following assumptions: (a) signal and (additive) noise are
stochastic processes with known spectral characteristics or known auto correlation and
cross-correlation, and (b) the performance criterion minimizes the mean-square error. An
optimal filter can be found from a solution based on scalar or multivariable methods [5].
The goal of the Wiener filter is to filter out noise that has corrupted the signal by
statistical means [6].
10
3.2 Block Diagram and Mathematical Solution
A linear system which sample unit response is h (n) and, when you enter a
random signal x(n)
x(n)  s (n)  v(n) …………………………………... (3.1)
where s(n) is signal, v(n) is Noise in signal. Now, output of linear system y(n) is
y (n)   h(m) x(n  m) ………………………………. (3.2)
m
Project aim is to apply original noisy signal x(n) to linear system h(n) and
received output signal y(n) is as close to s(n) which is our desired signal. So, estimated
value of sˆ( n) is,
y (n)  sˆ(n) …………………………………..... (3.3)
So basic diagram and design of project is:
x ( n)  s ( n)   ( n)
h(n)
y (n)  sˆ(n)
Figure 2: Basic Diagram and Design of Project
A Wiener filter block diagram as described by Haykin [8] is shown in fig.2.
11
d
Desired Signal (M
Dimensional Vector)
+
Sum
-
Error (M dimensional vector)
Filter output (M dimensional vector)
z
y
e
W
M x M filter
matrix
Input signal (M
Dimensional Vector)
Figure 3: Wiener Filter Block Diagram
From figure 3, we can see that microphone signal y apply to the filter and the
filter output z is our nearly desired signal. Still filtered signal z has some residual error,
e  d  z  d  Wy ……………………..……….(3.4)
Figure 3 shows that when z = d then e = 0. This means that when e = 0, z is the
estimated value of d. Therefore, when a desired signal comes with noise, a selected W
matrix is available for estimating the desired signal. Thus the goal of project part speech
enhancement is to compute the W matrix [7].
12
3.3 Wiener Filters Input-Output Relationships
As shown in figure-1, the linear system h(n) called for an estimator of s(n). Now,
sˆ & s(n) represents a estimated value and true value respectively, and e(n) s the error
between them. Now from calculation it is clear that e(n) may be positive or may also be
negative, and it is a random variable. Therefore, use it to express error mean square value
is reasonable. So Statically mean value of minimum mean square error is:

 

E e 2 (n)  E (s  sˆ) 2 …………………………….. (3.5)
Now we have known output of filter is:
N 1
y (n)  sˆ(n)   h(m) x(n  m) ………………………… (3.6)
m0
Error is:
N 1
e(n)  s (n)  sˆ(n)  s (n)   h(m) x (n  m) ………….…….. (3.7)
m0
So, mean square error:
N 1


E e2 (n)   E ( s (n)   h(m) x(n  m)) 2  ………………… (3.8)
m0


For filter impulse response h(m) m=0,1,
,N-1 , derivatives are:
N 1


2 E ( s(n)   hopt (m) x(n  m)) x(n  j )   0
m0


j  0,1, 2
N  1 ……. (3.9)
j  0,1,
N  1 ……. (3.10)
Further:
N 1
E  s (n) x(n  j )    hopt (m) E  x(n  m) x(n  j ) 
m 0
13
Now for W matrix calculation of figure 1, we can take reference from the
following figure:
z
Filter output (M dimensional vector)
W
y
M x M filter matrix
Input signal (M
Dimensional Vector)
Rxx
Rxs
VAD
Figure 4: W Matrix Calculation for Single Microphone Case [7]
Thus:
N 1
Rxs ( j )   hopt (m) Rxx ( j  m)
m0
j  0,1, 2,
, N  1 ……………… (3.11)
14
Therefore, we can get Linear Equations:
Rxs (0)  h(0) Rxx (0)  h(1) Rxx (1)   h( N  1) Rxx ( N  1)
 j0
 j 1
Rxs (1)  h(0) Rxx (1)  h(1) Rxx (0)   h( N  1) Rxx ( N  2)

….. (3.12)


 j  N  1 Rxs ( N  1)  h(0) Rxx ( N  1)  h(1) Rxx ( N  2)   h( N  1) Rxx (0)
Written in Matrix Form as:
Rxx ( N  1)   h(0)   Rxs (0) 
Rxx ( N  2)   h(1)   Rxs (1) 


………. (3.13)


 


 
Rxx (0)   h( N  1)   Rxs ( N  1) 
Rxx (1)
 Rxx (0)
 R (1)
Rxx (0)
xx



 Rxx ( N  1) Rxx ( N  2)
Simplified form of Matrix is:
Rxx H  Rxs …………………………………….. (3.14)
Where,
H=[h(0) h(1)
h(N-1)]' is the filter coefficients,
Rxs   Rxs (0), Rxs (1),
Rxs ( N 1) ' is the generated signal sequence coefficients,
and
Rxx (1)
 Rxx (0)
 R (1)
Rxx (0)
xx
Rxx  


 Rxx ( N  1) Rxx ( N  2)
Rxx ( N  1) 
Rxx ( N  2) 
is the autocorrelation matrix.


Rxx (0) 
15
We can see that in the design process of the Wiener filter sought under the
minimum mean square error or unit impulse response of the filter and transfer function
expressions. Its essence is a Wiener-Hopf equation. In addition, the wiener filter design
also has requirements of the signals and noise related functions.
3.4
The Wiener-Hopf Technique
In 1930, Eberhard Hopf joined the Department of Mathematics at the
Massachusetts Institute of Technology on a temporary contract with the help of Norbert
Wiener. The collaboration between Wiener and Hopf was initiated their mutual interest in
the differential equations governing the radiation equilibrium of stars [9]. In Wiener’s
own words [10], “The various types of particle which form light and matter exist in a sort
of balance with one another, which changes abruptly when we pass beyond the surface of
the star. It is easy to set up the equations for this equilibrium, but it is not easy to find a
general method for the solution of these equations.” Their collaboration in research
resulted in the Wiener-Hopf technique as a means to solve:

 k ( x  y) f ( y)dy  g ( x),0 x   ……………………………….(3.15)
0
The method proceeds by extending the domain of, or continuing, the integral
equation (3.15) to negative real values of x. So,

 g ( x),0  x, 
 k ( x  y) f ( y)dy  h( x),  x  0 …………………………… (3.16)
0
16
where h(x) is unknown. Now actual solution and full details can be found in the
textbook by Noble [11], but the Fourier transformation of (15) then yields the typical
Wiener-Hopf functional equation,
G ( )  H ( )  F ( )  K ( ) ………………………….. (3.17)
In which, H(α) and F(α) are half range Fourier transforms of the unknown
functions h(x) and f(x) respectively. By contrast, G(α) and K(α) are also half range
Fourier transform of known functions g(x) and k(x).
3.5
Wiener Filter Solution
We can see that in equation (3.17), the key of the program is in known input
signal and its auto correlation function. By solving Wiener-Hopf equation, we can get
output signal cross-correlation, resulting in Wiener filter.
Basic solution steps:
1. Initialized the values by :
a. a(0)  rxd (0) / rxx (0)
b. b(0)  rxd (1) / rxx (0)
2. For, j=1,2,
,M-1 , Make the following calculation:
j 1
a. temp1 
rxd ( j )   rxx ( j  i )a (i )
i 0
j 1
rxx (0)   rxx ( j  i )b(i )
i 0
b. a(i )  a(i )  temp 2  b(i )
i=0,1,
j-1
17
c. a( j )  temp1
j 1
d. temp 2 
rxx ( j  1)   rxx (i  1)b(i )
i 0
j 1
rxx (0)   rxx ( j  i )b(i )
i 0
e. b(i )  b(i  1)  temp 2  b( j  i )
i=1,
j
f. b(0)  temp 2
3. Filter response is :
m
a. h( j ) 
rxd ( M )   rxx (m  i ).a (i )
j 0
m
rxx (1)   rxx (m  i ).b(i )
j 0
4. Using above calculation we can get Wiener filter output signal of applied
input signal.
18
3.6 Simulation Results
Figure 5: Simulation Result of Wiener Filter
From figure 5, we can see that applied signal has low amplitude with higher noise.
In the other end, wiener filter output signal has more strength compare to noise. This
criterion helps to increase signal to noise ratio of applied signal. At last sound speaker of
wiener filter output is more clear than original noisy signal.
19
Chapter 4
DYNAMIC TIME WARPING
4.1
Overview
In this type of speech recognition technique the test data is converted to templates.
The recognition process then consists of matching the incoming speech with stored
templates. The template with the lowest distance measure from the input pattern is the
recognized word. The best match (lowest distance measure) is based upon dynamic
programming. This is called a Dynamic Time Warping (DTW) word recognizer.
In order to understand DTW, two concepts need to be deal with,

Features: the information in each signal has to be represented in some manner.

Distances: some form of metric has be used in order to obtain a match path.
There are two types:
o
Local : a computational difference between a feature of one signal and a
feature of the other.
o
Global : the overall computational difference between an entire signal
and another signal of possibly different length.
Since the feature, vectors could possibly have multiple elements, a means of
calculating the local distance is required. The distance measure between two feature
20
vectors is calculated using the Euclidean distance metric. Therefore the local distance
between feature vector x of signal 1 and feature vector y of signal 2 is given by,
d ( x, y ) 
 (x  y )
i
i
2
…………………………… (4.1)
i
4.2
DTW Algorithm
Speech is a time-dependent process. Hence, the utterances of the same word will
have different durations, and utterances of the same word with the same duration will
differ in the middle, due to different parts of the words being spoken at different rates. To
obtain a global distance between two speech patterns (represented as a sequence of
vectors) a time alignment must be performed.
This problem is illustrated in figure 0, in which a ``time-time'' matrix is used to
visualize the alignment. As with all the time alignment examples the reference pattern
(template) goes up the side and the input pattern goes along the bottom. In this illustration
the input ``SsPEEhH'' is a `noisy' version of the template ``SPEECH''. The idea is that `h'
is a closer match to `H' compared with anything else in the template. The input
``SsPEEhH'' will be matched against all templates in the system's repository. The best
matching template is the one for which there is the lowest distance path aligning the input
pattern to the template. A simple global distance score for a path is simply the sum of
local distances that go to make up the path.
21
Figure 6: DTW Algorithm to Search Path
To make the algorithm and reduce excessive computation we apply certain
restriction on the direction of propagation. The constraints are given below.

Matching paths cannot go backwards in time.

Every frame in the input must be used in a matching path.

Local distance scores are combined by adding to give a global distance.
This algorithm is known as Dynamic Programming (DP). When applied to
template-based speech recognition, it is often referred to as Dynamic Time Warping
(DTW). DP is guaranteed to find the lowest distance path through the matrix, while
minimizing the amount of computation. The DP algorithm operates in a time-
22
synchronous manner: each column of the time-time matrix is considered in succession
(equivalent to processing the input frame-by-frame) so that, for a template of length N,
the maximum number of paths being considered at any time is N. If D(i,j) is the global
distance up to (i,j) and the local distance at (i,j) is given by d(i,j)
D(i, j )  min[ D(i  1, j  1), D(i  1, j ), D(i, j  1)]  d (i, j ) ….……… (4.2)
Given that D(1,1) = d(1,1) (this is the initial condition), we have the basis for an
efficient recursive algorithm for computing D(i,j). The final global distance D(n,N) gives
us the overall matching score of the template with the input. The input word is then
recognized as the word corresponding to the template with the lowest matching score.
4.3
Mel-Frequency Cepstrum Coefficients
In sound processing, the mel-frequency cepstrum (MFC) is a representation of the
short-term power spectrum of a sound, based on a linear cosine transform of a log power
spectrum on a nonlinear mel scale of frequency.
Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively
make up an MFC. They are derived from a type of cepstral representation of the audio
clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and
the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced
on the mel scale, which approximates the human auditory system's response more closely
23
than the linearly-spaced frequency bands used in the normal cepstrum. This frequency
warping can allow for better representation of sound, for example, in audio compression.
MFCCs are commonly derived as follows: [12]
1. Take the Fourier transform of (a windowed excerpt of) a signal.
2. Map the powers of the spectrum obtained above onto the mel scale, using
triangular overlapping windows.
3. Take the logs of the powers at each of the mel frequencies.
4. Take the discrete cosine transform of the list of mel log powers, as if it were a
signal.
5. The MFCCs are the amplitudes of the resulting spectrum.
4.4
Applications
Using the above algorithm, we generate templates for the training data set. This
set includes 5 utterances for each of the five command words: “Left", “Right", "Stop",
"Forward" and "Backward". After the feature extraction is done on the test word, it is
matched with all the templates and the minimum distance is calculated for each. The
word is classified as one of the four depending on the global minimum distance. In this
project, I developed the algorithm to use for controlling two motors of the toy car.
24
4.5
Distance Matrix Coefficient Result:
Scores1 =
1.0e+004 *
0.5741 1.8575 3.7164 2.7567 3.2732 2.4946 2.1208 1.8565 3.4530 1.7783
1.6183 0.6201 5.4711 4.0024 4.5169 3.8422 3.025 1.439 5.5660 3.4090
3.5879 5.2763 0.4147 1.4435 2.4924 3.1520 2.5371 3.5035 1.0029 2.9692
2.2657 3.6250 1.0511 0.5487 2.1223 2.2573 1.8405 2.5638 1.0303 2.0788
2.4052 3.9130 2.1418 1.8048 0.4234 2.9629 2.6261 2.5600 2.1139 2.2589
2.2896 3.2425 3.0996 2.7853 3.0529 0.5002 2.2229 3.2317 2.7873 1.9719
2.1692 3.8311 3.1997 2.6173 2.9306 2.4412 0.6207 2.9675 2.6489 1.5591
1.9428 2.2470 4.0326 2.6934 3.1362 3.6682 3.4982 0.7567 4.3060 2.6786
3.2084 4.7374 1.0319 1.4739 2.2992 2.9104 2.2509 3.8464 0.4946 2.5945
2.2231 4.4386 2.9113 2.5731 2.7994 2.2883 1.3779 3.6557 2.2412 0.3803
With this coefficient result you can generate the sound which presents you
matching result of speech recognition. Particular above result shows the coefficient
generated for “Left” command, which can be played. In this project, this type of scores
for different people’s speech would make a vector called Allscores. That vector index
applied to KNN algorithm to calculate the final output index.
25
Chapter 5
K-NEAREST NEIGHBOR ALGORITHM
5.1
Introduction
K Nearest Neighbor (KNN) is one of those algorithms that are very simple to
understand but works incredibly well in practice. In addition, it is surprisingly versatile
and its applications range from vision to proteins to computational geometry to graphs
and so on. Most people learn the algorithm and do not use it much that is a pity as a
clever use of KNN can make things very simple. It also might surprise many to know that
KNN is one of the top 10 data mining algorithms.
KNN is a non-parametric lazy learning algorithm. That is a concise statement.
This is useful, as in the real world, most of the practical data does not obey the typical
theoretical assumptions made (e.g. Gaussian mixtures, linearly separable etc). Nonparametric algorithms like KNN come to the rescue here.
It is also a lazy algorithm. What this means is that it does not use the training data
points to do any generalization. In other words, there is no explicit training phase or it is
minimal. This means the training phase is fast. Lack of generalization means, that KNN
keeps all the training data. More exactly, all the training data is needed during the testing
phase. (Well this is an exaggeration, but not far from truth). Most of the lazy algorithms –
26
especially KNN – make decision based on the entire training data set (in the best case a
subset of them).
The dichotomy is obvious here – There is a nonexistent or minimal training phase
but a costly testing phase. The cost is in terms of both time and memory. More time
might be needed as in the worst case; all data points might take point in decision. More
memory is needed as we need to store all training data.
27
5.2
Assumptions in KNN
KNN assumes that the data is in a feature space. More exactly, the data points are
in a metric space. The data can be scalars or possibly even multidimensional vectors.
Since the points are in feature space, they have a notion of distance – This need not
necessarily be Euclidean distance although it is the one commonly used.
Figure 7: Euclidean Distances between Two Vectors Xr and Xs [13]
d ( Xr, Xs)  Xr  Xs  ( Xr1  Xs1)2  ( Xr 2  Xs 2)2 …………… (5.1)
Each of the training data consists of a set of vectors and class label associated
with each vector. In the simplest case, it will be either + or – (for positive or negative
classes). But KNN, can work equally well with arbitrary number of classes.
28
We are also given a single number "k”. This number decides how many neighbors
(where neighbors are defined based on the distance metric) influence the classification.
This is usually an odd number if the number of classes is two. If k=1, then the algorithm
is simply called the nearest neighbor algorithm.
5.3
KNN for Density Estimation
Although classification remains the primary application of KNN, we can use it to
do density estimation also. Since KNN is non-parametric, it can do estimation for
arbitrary distributions. The idea is very similar to use of parzen window. Instead of using
hypercube and kernel functions, here we do the estimation as follows – For estimating the
density at a point x, place a hypercube centered at x and keep increasing its size till k
neighbors are captured. Now estimate the density using the formula,
p( x) 
k/n
…………………………………… (5.2)
V
Where n is the total number of V is the volume of the hypercube. Notice that the
numerator is essentially a constant and the volume influences the density. The intuition is
this: Let’s say density at x is very high. Now, we can find k points near x very quickly.
These points are also very close to x (by definition of high density). This means the
volume of hypercube is small and the resultant density is high. Let’s say the density
29
around x is very low. Then the volume of the hypercube needed to encompass k nearest
neighbors is large and consequently, the ratio is low.
The volume performs a job similar to the bandwidth parameter in kernel density
estimation. In fact, KNN is one of common methods to estimate the bandwidth (e.g.
adaptive mean shift).
5.4
KNN Classification
In this case, we are given some data points for training and a new unlabelled data
for testing. Our aim is to find the class label for the new point. The algorithm has
different behavior based on k.
Case 1: k = 1 or Nearest Neighbor Rule
This is the simplest scenario. Let x be the point to be labeled. Find the point
closest to x. Let it be y. Now nearest neighbor rule asks to assign the label of y to x. This
seems too simplistic and sometimes even counter intuitive. If you feel that this procedure
will result a huge error, you are right – but there is a catch. This reasoning holds only
when the number of data points is not very large.
If the number of data points is very large, then there is a very high chance that
label of x and y is same. An example might help – suppose that you have a (potentially)
biased coin. You toss it for 1 million time and you have head 900,000 times. Then most
likely, your next call will be head.
30
Now, assume all points are in a D dimensional plane. The number of points is
reasonably large. This means that the density of the plane at any point is high. In other
words, within any subspace there is adequate number of points. Consider a point x in the
subspace which also has many neighbors. Now let y be the nearest neighbor. If x and y
are sufficiently close, then we can assume that probability that x and y belong to same
class is same – Then by decision theory, x and y have the same class.
The book "Pattern Classification" by Duda and Hart has an excellent discussion
about this Nearest Neighbor rule. One of their striking results is to obtain a tight error
bound to the Nearest Neighbor rule. The bound is
P*  P  P *(2 
Where
c
P*) …………………………… (5.1)
c 1
is the Bays error rate, c is the number of classes and P is the error rate
of Nearest Neighbor. The result is indeed very striking (at least to me) because it says
that if the number of points is large then the error rate of Nearest Neighbor is less than
twice the Bays error rate.
Case 2: k = K or k-Nearest Neighbor Rule
This is a straightforward extension of 1NN. What we do is that we try to find the
k nearest neighbor and do a majority voting. Typically, k is odd when the number of
classes is 2. Let us say k = 5 and there are 3 instances of C1 and 2 instances of C2. In this
31
case, KNN says that new point has to label as C1 as it forms the majority. We follow a
similar argument when there are multiple classes.
One of the straightforward extensions is not to give 1 vote to all the neighbors. A
very common thing to do is weighted KNN where each point has a weight which is
typically calculated using its distance. For e.g. under inverse distance weighting, each
point has a weight equal to the inverse of its distance to the point to be classified. This
means that neighboring points have a higher vote than the farther points.
It is obvious that the accuracy might increase when you increase k but the
computation cost also increases.
32
5.5
Implementation of KNN
The algorithm on how to compute the K-nearest neighbors is as follows:
1. Determine the parameter K = number of nearest neighbors beforehand. This
value is all up to you.
2. Calculate the distance between the query-instance and all the training samples.
You can use any distance algorithm.
3. Sort the distances for all the training samples and determine the nearest
neighbor based on the K-th minimum distance.
4. Since this is supervised learning, get all the Categories of your training data
for the sorted value which fall under K.
5. Use the majority of nearest neighbors as the prediction value.
5.6
Results
After implementing this algorithm, I completed whole speech recognition
software. I tried for particular my voice and it worked 100% efficiently. In other case, I
tried with my roommate and friends voices and it worked around 85% successfully.
Therefore, the average success ratio of this algorithm is around 92%.
33
Chapter 6
HARDWARE IMPLEMENTATION
6.1
Overview
The first goal of this project was to create a circuit implementing speech
voice controlled robot car that utilized stand-alone hardware to perform the
functions instead of the more commonly used software. The HM2007 IC and all
the other components comprising this circuit were assembled and wired on
designed PCB. Table 2 below shows the parts list used in creating this circuit.
Components
Value
Quantity
HM 2007 SR-7 kit
N/A
AT89c52
N/A
ULN 2003
N/A
78LS05
N/A
Capacitors
0.1 uF
Capacitors
100 uF
Capacitors
0.22 uF
Resistors
220 ohms
Resistors
6.8 Kohms
Resistors
330 ohms
Oscillator (XTAL)
1.5 MHz
Relays
12 V
Motors
12 V, 200 RPM
Table 2: Component List
1
1
1
2
1
1
2
2
1
3
1
4
2
34
All of the above components were purchased from Images Scientific Instruments, Inc.
The circuit of SR-7 kit is shown in Figure 8 below,
Figure 8: Circuit Schematic of SR – 7 [14]
35
The microphone and the keypad consist of the only user interfaces with the
circuit. The microphone is a standard PC microphone, which acts as the transducer
converting the pressure waves to an electrical signal. The microphone is coupled to the
HM2007 IC, which is attempting to classify each word into the different trained
categories. The keypad consists of 12 normally open momentary contact switches. These
were soldered onto a printed circuit board (PCB) which was used to communicate with
the HM2007 IC. The keypad allowed the user to train the system and clear the memory.
The circuit outputs consist of the two 7-Segment Displays and the LED. The 7-Segment
Displays show any error codes, show the target being trained, and the final classification
by the HM2007 system. As designed in the circuit, the top display is the most significant,
and the bottom is the least significant. For example, the number 9 would show a 0 on the
top display and a 9 on the bottom display. Only 01 through 09 was used for this project.
The LED is connected to the HM2007 IC and is used to show the status of the HM2007
IC. When the LED on, the system is listening and will classify all incoming sounds.
When the LED is off, the system has been placed in training mode, and when the LED
flashes, it indicates that the word spoken was just successfully trained and placed into
memory.
When the circuit is turned on, the HM2007 checks the static RAM. If everything
checks out the board displays "00" on the digital display and lights the red LED
(READY). It is in the "Ready" waiting for a command.
36
6.2
Block Diagram of Hardware
Motor-1
Voice Input
(Microphone)
HM 2007
SR-07 Kit
Speech
Processing
Logic
Microcontroller
(AT89C52)
Generating Relay
Outputs
Darlington Array
ULN 2003 Chip
Increasing Driving
Strength
Relay Network
4x4 Driving Motors
Motor-2
Figure 9: Block Diagram of Hardware
6.3
Training and Working of HM-2007
To train the circuit begins by pressing the word number you want to train on the
keypad. The circuit can be trained to recognize up to 40 words. Use any numbers
between 1 and 40. For example, press the number "1" to train word number 1. When you
press the number(s) on the keypad the red led will turn off. The number is displayed on
the digital display. Next press the "#" key for train. When the "#" key is pressed, it
signals the chip to listen for a training word and the red led turns back on. Now speak the
word you want the circuit to recognize into the microphone clearly. The LED should
blink off shortly; this is a signal that the word has been accepted.
Continue training new words in the circuit using the procedure outlined above. Press the
"2" key then "#" key to train the second word and so on. The circuit will accept up to
37
forty words. You do not have to enter 40 words into memory to use the circuit. If you
want, you can use as many word spaces as you want.
The circuit is continually listening. Repeat a trained word into the microphone.
The number of the word should be displayed on the digital display. For instance if the
word "directory" was trained as word number 25. Saying the word "directory" into the
microphone will cause the number 25 to be displayed.
38
The chip provides the following error codes:
a. 55 = word too long
b. 66 = word too short
c. 77 = word no match
6.4
Microcontroller Interfacing with Motors
Now the output of HM-2007 Kit goes to Microcontroller’s port-1 inputs. Now
microcontroller is deciding, which action has to perform and it generating output on port
P3 according to command. So on P3 port you can find the output of microcontroller and
input for motor driving. Still this output is not able to drive motors directly. For that, we
need some driving circuit, which generates high DC voltage to drive 12V 200 rpm
motors. Therefore, Microcontroller output is drive by ULN 2003 IC. This IC is build up
of seven Darlington Arrays.
Now the Darlington transistor (often called a Darlington pair) is a compound
structure consisting of two bipolar transistors (either integrated or separated devices)
connected in such a way that the current amplified by the first transistor is amplified
further by the second one. Overall gain of Darlington Array: βdarlington = β1 . β2
39
Figure 10: Simple Darlington Transistor
This Kind of Darlington transistor is helpful to drive relay network, which
controlled the Motors. In figure 11 shows the circuit diagram of microcontroller
interfacing circuit.
Figure 11: Microcontroller Interfacing Circuit
40
From Above circuit if you combined this with HM2007 SR-07 kit you can control
two motors according to your voice commands. In this particular project, I trained the
SR-07 kit for five commands: Left, Right, Forward, Backward and Stop. So it generating
BCD output which goes to Microcontroller and it is decide what output generate for relay
to drive motors. The whole project is created on small metal box and made it as small toy
car. Figure 12 showing the picture of same project I made.
Figure 12: Pictures of Whole Project
41
6.5
Limitation of Hardware
According to testing results, this hardware is just working for one-person voice
who trained the kit. In addition, it is not working in noisy environment. Some time some
different echo voice also not recognized by the kit. Because of this error, you cannot
control motors on proper instant. We can limit this limitation by developing high efficient
algorithm such as nearest neighbor algorithm. As earlier discussed in this project, we can
apply output of the MATLAB code to microcontroller interfacing circuit so might be it
would work more efficiently.
42
Chapter 7
CONCLUSION
This project shows that KNN algorithm followed by DTW is efficient algorithm
of speech recognition. In addition, the wiener filter can increase signal to noise ratio of
applied signal. This filtered signal is enhancing the efficiency of Speech Recognition.
Also many major companies using this kind of technique in their customer service
systems. In addition, the hardware of this project shows that normal DTW can use for
smaller task and you can enhance this project through better filter and algorithm database.
Moreover, in forward of this project, you can load the source code of the software design
in higher end microcontroller like ATMEGA-32, and its output can drive microcontroller
interface for efficient way.
43
APPENDIX A
Software Code
1) Recognition_Main:
clc;
nc = 13;
%Required number of mfcc coefficients
N = 5;
%Number of words in vocabulary
k = 3;
%Number of nearest neighbors to choose
fs=16000;
%Sampling rate
duration1 = 0.15; %Initial silence duration in seconds
duration2 = 2;
%Recording duration in seconds
G=2;
%vary this factor to compensate for amplitude variations
NSpeakers = 5;
%Number of training speakers
fprintf('Press any key to start %g seconds of speech recording...', duration2);
pause;
silence = wavrecord(duration1*fs, fs);
fprintf('Recording speech...');
speechIn = wavrecord(duration2*fs, fs); % duration*fs is the total number of sample
points
fprintf('Finished recording.\n');
fprintf('System is trying to recognize what you have spoken...\n');
speechIn1 = [silence;speechIn];
%pads with 150 ms silence
speechIn2 = speechIn1.*G;
speechIn3 = speechIn2 - mean(speechIn2);
%DC offset elimination
speechIn = WienerFilter(speechIn3);
%Applies spectral subtraction
rMatrix = mfcc(nc,speechIn,fs);
%Compute test feature vector
Sco = DTWCalc(rMatrix,N);
[SortedScores,EIndex] = sort(Sco);
K_Vector = EIndex(1:k);
Neighbors = zeros(1,k);
%computes all DTW scores
%Sort scores increasing
%Gets k lowest scores
%will hold k-N neighbors
%Essentially, code below uses the index of the returned k lowest scores to
%determine their classes
for t = 1:k
u = K_Vector(t);
for r = 1:NSpeakers-1
if u <= (N)
44
break
else u = u - (N);
end
end
Neighbors(t) = u;
end
%Apply k-Nearest Neighbor rule
Nbr = Neighbors;
%sortk = sort(Nbr);
[Modal,Freq] = mode(Nbr);
%most frequent value
Word = strvcat('Stop','Left','Right','Forward','Backward');
if mean(abs(speechIn)) < 0.01
fprintf('No microphone connected or you have not said anything.\n');
elseif ((k/Freq) > 2)
%if no majority
fprintf('The word you have said could not be properly recognised.\n');
else
fprintf('You have just said %s.\n',Word(Modal,:)); %Prints recognized word
end
2) Creat Refrence File:
clc;
nc=13;
Ref1 = cell(1,5);
Ref2 = cell(1,5);
Ref3 = cell(1,5);
Ref4 = cell(1,5);
Ref5 = cell(1,5);
%Required number of mfcc coefficients
for j = 1:5
q = ['\SpeechData\1\5_' num2str(j) '.wav'];
[speechIn1,FS1] = wavread(q);
%disp(FS1);
speechIn1 = WienerFilter(speechIn1);
Ref1(1,j) = {mfcc(nc,speechIn1,FS1)}; %MFCC coefficients are
%computed here
end
for k = 1:5
q = ['\SpeechData\2\5_' num2str(k) '.wav'];
[speechIn2,FS2] = wavread(q);
45
%disp(FS2);
speechIn2 = WienerFilter(speechIn2);
Ref2(1,k) = {mfcc(nc,speechIn2,FS2)};
end
for l = 1:5
q = ['\SpeechData\3\5_' num2str(l) '.wav'];
[speechIn3,FS3] = wavread(q);
%disp(FS3);
speechIn3 = WienerFilter(speechIn3);
Ref3(1,l) = {mfcc(nc,speechIn3,FS3)};
end
for m = 1:5
q = ['\SpeechData\4\5_' num2str(m) '.wav'];
[speechIn4,FS4] = wavread(q);
%disp(FS4);
speechIn4 = WienerFilter(speechIn4);
Ref4(1,m) = {mfcc(nc,speechIn4,FS4)};
end
for n = 1:5
q = ['\SpeechData\5\5_' num2str(n) '.wav'];
[speechIn5,FS5] = wavread(q);
%disp(FS5);
speechIn5 = WienerFilter(speechIn5);
Ref5(1,n) = {mfcc(nc,speechIn5,FS5)};
end
%Converts the cells containing all matrices to structures and save
%structures in matlab .mat files in the working directory.
labels = {'Stop','Left','Right','Forward','Backward'};
s1 = cell2struct(Ref1, labels, 2);
save Vectors1.mat -struct s1;
s2 = cell2struct(Ref2, labels, 2);
save Vectors2.mat -struct s2;
s3 = cell2struct(Ref3, labels, 2);
save Vectors3.mat -struct s3;
s4 = cell2struct(Ref4, labels, 2);
save Vectors4.mat -struct s4;
s5 = cell2struct(Ref5, labels, 2);
save Vectors5.mat -struct s5;
3) Function Wiener Filter:
46
function Speechout=WienerFilter(Speechin)
d=Speechin; d=d*8; d=d'; %Enhanced Speech Signal Strength So we can generate the
Noise signal of proper elements
%fq=fft(d,8192); % Discreat Fourier Transform
%t = 0:1:36062;
%subplot(3,1,1);
%f=Fs*(0:4095)/8192; %Fixed Frequency Spectrum
%plot(f,(y(1:4096)));
%plot(t,y);
%title ('Original voice signal in frequency domain graphics');
%xlabel ('frequency F');
%ylabel ('FFT');
%[m,n]=size(d);
%x_noise=randn(1,n);
%x=d+x_noise; %After joining the noise of speech signals, noise is (0,1) distribution of
Gaussian white noise
x = d;
%fq=fft(x,8192);
%subplot(3,1,2);
%plot(f,(x(1:4096)));
%plot(t,x);
%title ('Speech Signal with Noise in frequency domain graphics');
%xlabel ('frequency F');
%ylabel('FFT');
Rxxcorr=xcorr(x(1:4096));
size(Rxxcorr);
A=Rxxcorr(4096:4595); %Rxx of the wienerfilter
Rxdcorr=xcorr(d(1:4096),x(1:4096));
size(Rxdcorr);
B=Rxdcorr(4096:4595); %Rxd of the wienerfilter
M=500; %Length of the Filter
Speechout=wienerfilter1(x,A,B,M); %Denoising using Wiener filtering
%Result = Result/6; %Enhanced Result Signal
%fq=fft(Result);
%subplot(3,1,3);
%f=Fs*(0:4095)/8192;
%plot(f,(Result(1:4096)));
%plot(t,Result);
%title ('Wiener filtering voice signal in frequency domain');
%xlabel ('frequency F');
%ylabel('FFT');
end
47
function y=wienerfilter1(x,Rxx,Rxd,N)
%For Wiener filtering
%x is an input signal,Rxx is the input signal's autocorrelation vector
%Rxx is an input signal and ideal signal cross-correlation of vector,n is the length of the
Wiener filter
%Output y is an input signal by Wiener filter for Wiener filtering output
h=wienersolution(Rxx,Rxd,N);%solution of Wiener filter coefficient
t=conv(x,h); %Filter
Lh=length(h); %Get length of the filter
Lx=length(x); %Get the length of the input signal
y=t(double(uint16(Lh/2)):Lx+double(uint16(Lh/2))-1);%output sequence y and the
length of the input sequence xlength of the same
%y = t;
end
4) Wiener_Solution:
function h = wienersolution(A,B,M)
%Solution of wiener-hopf equation
%A (RXX) is autocorrelation for received signal vector as Rxx (0), Rxx (1), ... ..., Rxx
(M-1)
%B (Rxd) is to receive signals, and there is no noise interference signal cross correlation
vector to Rxd (0), Rxd (1), ... ..., Rxd (M-1)
%M is the length of the filter
%H saved filter coefficients
%For example A=[6,5,4,3,2,1]; B=[100,90,120,50,80,200]; M=6;
%Solution h=[26.4286-20.0000 50.0000-50.0000-45.0000 81.4286]'
T1=zeros(1,M);%T1 storage of intermediate vector equation solution
T2=zeros(1,M);%T2 storage of intermediate vector equation solution
T1(1)=B(1)/A(1);
T2(1)=B(2)/A(1);
X=zeros(1,M);
for i=2:M-1
temp1=0;
temp2=0;
for j=1:i-1
temp1=temp1+A(i-j+1)*T1(j);
temp2=temp2+A(i-j+1)*T2(j);
end
X(i)=(B(i)-temp1)/(A(1)-temp2);
for j=1:i-1
X(j)=T1(j)-X(i)*T2(j);
48
end
for j=1:i
T1(j)=X(j);
end
temp1=0;
temp2=0;
for j=1:i-1
temp1=temp1+A(j+1)*T2(j);
temp2=temp2+A(j+1)*T2(i-j);
end
X(1)=(A(i+1)-temp1)/(A(1)-temp2);
for j=2:i
X(j)=T2(j-1)-X(1)*T2(i-j+1);
end
for j=1:i
T2(j)=X(j);
end
end
temp1=0;
temp2=0;
for j=1:M-1
temp1=temp1+A(M-j+1)*T1(j);
temp2=temp2+A(M-j+1)*T2(j);
end
X(M)=(B(M)-temp1)/(A(1)-temp2);
for j=1:M-1
X(j)=T1(j)-X(M)*T2(j);
end
h=X;
end
49
5) Mel_Frequency Cepstrum Coefficient:
function REF=mfcc(num,s,Fs)
n=512;
%Number of FFT points
Tf=0.025;
%Frame duration in seconds
N=Fs*Tf;
%Number of samples per frame
fn=24;
%Number of mel filters
l=length(s);
%total number of samples in speech
Ts=0.01;
%Frame step in seconds
FrameStep=Fs*Ts; %Frame step in samples
a=1;
b=[1, -0.97];
%a and b are high pass filter coefficients
noFrames=floor(l/FrameStep); %Maximum no of frames in speech sample
REF=zeros(noFrames-2, num); %Matrix to hold cepstral coefficients
lifter=1:num;
%Lifter vector index
lifter=1+floor((num)/2)*(sin(lifter*pi/num));%raised sine lifter version
if mean(abs(s)) > 0.01
s=s/max(s);
end
%Normalises to compensate for mic vol differences
%Segment the signal into overlapping frames and compute MFCC coefficients
for i=1:noFrames-2
frame=s((i-1)*FrameStep+1:(i-1)*FrameStep+N); %Holds individual frames
Ce1=sum(frame.^2);
%Frame energy
Ce2=max(Ce1,2e-22);
%floors to 2 X 10 raised to power -22
Ce=log(Ce2);
framef=filter(b,a,frame); %High pass pre-emphasis filter
%v = fix(N);
v = hamming(N);
v = v';
F=framef.*v;
%multiplies each frame with hamming window
FFTo=fft(F,N);
%computes the fft
FFTo = FFTo';
melf=melbankm(fn,n,Fs); %creates 24 filter, mel filter bank
halfn=1+floor(n/2);
spectr1=log10(melf*abs(FFTo(1:halfn)).^2);%result is mel-scale filtered
spectr=max(spectr1(:),1e-22);
c=dct(spectr);
%obtains DCT, changes to cepstral domain
c(1)=Ce;
%replaces first coefficient
coeffs=c(1:num);
%retains first num coefficients
ncoeffs=coeffs.*lifter'; %Multiplies coefficients by lifter value
50
REF(i, :)=ncoeffs';
end
%assigns mfcc coeffs to succesive rows i
%Call the deltacoeff function to compute derivatives of MFCC
%coefficients; add all together to yield a matrix with 3*num columns
d=(dcoeff(REF)).*0.6; %Computes delta-mfcc
d1=(dcoeff(d)).*0.4;
%as above for delta-delta-mfcc
REF=[REF,d,d1];
%concatenates all together
6) Dtw Algorithm Calculation
function AllScores = DTWCalc(rMatrix,N)
%Vectors to hold DTW scores
Scores1 = zeros(1,N);
Scores2 = zeros(1,N);
Scores3 = zeros(1,N);
Scores4 = zeros(1,N);
Scores5 = zeros(1,N);
%Load the reference templates from file
s1 = load('Vectors1.mat');
REFall1 = struct2cell(s1);
s2 = load('Vectors2.mat');
REFall2 = struct2cell(s2);
s3 = load('Vectors3.mat');
REFall3 = struct2cell(s3);
s4 = load('Vectors4.mat');
REFall4 = struct2cell(s4);
s5 = load('Vectors5.mat');
REFall5 = struct2cell(s5);
%Compute DTW scores for test template against all reference templates
for i = 1:N
REF1 = REFall1{i,1};
Scores1(i) = DTW(REF1,rMatrix);
end
for j = 1:N
REF2 = REFall2{j,1};
Scores2(j) = DTW(REF2,rMatrix);
end
51
for m = 1:N
REF3 = REFall3{m,1};
Scores3(m) = DTW(REF3,rMatrix);
end
for p = 1:N
REF4 = REFall4{p,1};
Scores4(p) = DTW(REF4,rMatrix);
end
for q = 1:N
REF5 = REFall5{q,1};
Scores5(q) = DTW(REF5,rMatrix);
end
AllScores = [Scores1,Scores2,Scores3,Scores4,Scores5];
function [cost] = DTW(featureMatrix,RefMatrix)
F = featureMatrix;
R = RefMatrix;
[r1,c1]=size(F);
%test matrix dimensions
[r2,c2]=size(R);
%reference matrix dimensions
localDistance = zeros(r1,r2);%Matrix to hold local distance values
%The local distance matrix is derived below
for n=1:r1
for m=1:r2
FR=F(n,:)-R(m,:);
FR=FR.^2;
localDistance(n,m)=sqrt(sum(FR));
end
end
D = zeros(r1+1,r2+1); %Matrix of zeros for local dist matrix
D(1,:) = inf;
%Pads top with horizontal infinite values
D(:,1) = inf;
%Pads left with vertical infinite values
D(1,1) = 0;
D(2:(r1+1), 2:(r2+1)) = localDistance;
%This loop iterates through distance matrix to obtain global
%minimum distance
52
for i = 1:r1;
for j = 1:r2;
[dmin] = min([D(i, j), D(i, j+1), D(i+1, j)]);
D(i+1,j+1) = D(i+1,j+1)+dmin;
end
end
cost = D(r1+1,r2+1);
%returns overall global minimum score
7) Dc Coefficient:
Function Diff = Dcoeff(X)
[Nr,Nc] = Size(X);
K = 3;
%Number Of Frame Span(Backward And Forward Span Equal)
B = K:-1:-K; %Vector Of Filter Coefficients
%Pads Cepstral Coefficients Matrix By Repeating First And Last Rows K Times
Px = [Repmat(X(1,:),K,1);X;Repmat(X(End,:),K,1)];
Diff = Filter(B, 1, Px, [], 1); % Filter Data Vector Along Each Column
Diff = Diff/Sum(B.^2);
%Divide By Sum Of Square Of All Span Values
% Trim Off Upper And Lower K Rows To Make Input And Output Matrix Equal
Diff = Diff(K + [1:Nr],:);
53
APPENDIX B
Hardware Code
1) Code For Microcontroller:
#include<reg51.h>
#define voice P2
sbit RELAY1=P1^0;
sbit RELAY2=P1^1;
sbit RELAY3=P1^2;
sbit RELAY4=P1^3;
void main()
{
P1=0X00;
while(1)
{
if(voice==0XF1)
//Generating Relay Combination for Left
{
RELAY1=1;
RELAY2=0;
RELAY3=1;
RELAY4=0;
}
else if(voice==0XF2) //Generating Relay Combination for Right
{
RELAY1=0;
RELAY2=1;
RELAY3=0;
RELAY4=1;
}
else if(voice==0XF3) //Generating Relay Combination for Forward
{
RELAY1=1;
RELAY2=0;
RELAY3=0;
RELAY4=1;
}
54
else if(voice==0XF4) //Generating Relay Combination for Backward
{
RELAY1=0;
RELAY2=1;
RELAY3=1;
RELAY4=0;
}
else if(voice==0XF5)
{
RELAY3=1;
}
else if(voice==0XF6)
{
RELAY3=0;
}
else if(voice==0XF7)
{
RELAY4=1;
}
else if(voice==0XF8)
{
RELAY4=0;
}
}
}
55
BIBLIOGRAPHY
[1] Kerstin Dautenhahn (2005), “Social Intelligence and Interaction in Animal Robots”,
SSAISB Convention, UK pp 1-3
[2] Yi Hu, P. C. Loizou (2006), “Subjective comparison of speech enhancement
Algorithms”, IEEE 2006 International conference, USA pp 1-5
[3] Wiener N (1949), “Extrapolation Interpolation and Smoothing of stationary Time
Series”, Wiley, USA pp 10-15
[4] Kolmogorov A (1941), “Interpolation and Extrapolation of stationary random
sequences”, Mathematic series, USA pp 3-14
[5] Barrett J, Moir T (1987), “A Unified Approach to Multivariable, Discrete Time
filtering based on the Wiener Theory”, Kyberbetika 23 pp 177-197
[6] Brown R, Hwang P (1996), “Introduction to Random Signals and Applied Kalman
Filtering”, 3rd John Wiley & Sons, New-York
[7] Z. Qi and T. Moir (2007), “An Adaptive Wiener Filter for an Automotive Application
with Non-Stationary Noise”, 2nd International Conference on Sensing Technology, USA
pp 300-305
[8] Haykin S (2002), “Adaptive Filter Theory”, 4ed Prentice Hall, Englewood Cliffs
[9] Jane B and David A (2007), “A Brief Historical Perspective of the Wiener-Hopf
Techniques”, University of Manchester Springer, USA pp 351-356
.
[10] Wiener N (1956), “I am a Mathematician”, Doubleday & co. Inc, Garden city,USA
[11] Jones DS (1952), “A Simplifying Technique in the Solution of a Class of Diffraction
Problems”, Quart J Math 3 pp 189-196
[12] R. Plomp, L. C. Pols and J. P. Van de Geer (1967), “Dimensional Analysis of Vowel
Spectra”, J Acoustical Society of America pp 707-712
[13] Thomas B (2008), “K-Nearest Neighbors Algorithm, Southern Methodist
University”, Texas pp 1-5
[14] SR-06/SR-07 Speech Recognition Kit, Images SI Inc., Staten Island, NY,
http://www.imagesco.com/speech/speech-recognition-index.html
Download