Final Report - University of Victoria

advertisement
University of Victoria
Faculty of Engineering
Spring 2007 Elec499B Final Report
Speech Activated Appliances
Group Members
David Beckstrom
Harun Ali
Kunal Jadav
Nicole Su Lee, Ng
Zorawar Bhatia
In partial fulfillment of the requirements of the
B.Eng.Degree
1
Table of Content
1.0 Introduction ................................................................................................................... 3
Concept: .......................................................................................................................... 3
Goal of Project: ............................................................................................................... 3
Realization: ..................................................................................................................... 3
Demonstration System Operation: .................................................................................. 4
Software Concept: ........................................................................................................... 5
Feature extraction........................................................................................................ 5
Database creation: ....................................................................................................... 5
Pattern Recognition Algorithm: .................................................................................. 6
Output interface and Hardware interface Concept: ........................................................ 6
......................................................................................................................................... 6
2.0 Background ................................................................................................................... 8
Speech recognition: ......................................................................................................... 8
Requirements for speech recognition.............................................................................. 8
Information contained in Database ................................................................................. 9
Types of Algorithms ................................................................................................. 10
Other concepts used in this report ................................................................................. 13
Windowing................................................................................................................ 13
Modelling of speech .................................................................................................. 13
3.0 Database ...................................................................................................................... 16
Database Design............................................................................................................ 16
Database construction ................................................................................................... 16
Feature Extraction ..................................................................................................... 16
Feature matrix and database formation ..................................................................... 20
Database comparison .................................................................................................... 21
4.0 Dynamic Time Warping (DTW) ................................................................................. 23
Distance between Two Sequences of Vectors. ............................................................. 23
Comparing the distance between two sequences of vectors of different length ....... 24
Finding the Optimal Path .............................................................................................. 25
Local distances .......................................................................................................... 25
5.0 Experiments & Results ............................................................................................... 27
6.0 Hardware ................................................................................................................... 288
7.0 Conclusion .................................................................................................................. 30
8.0 References ................................................................................................................... 31
2
1.0 Introduction
Concept:
This project will demonstrate the next generation of home automation technology,
speech activated appliances. Imagine being able to control the lights in a room,
or the temperature of your home with a simple spoken command. Imagine the
security of having your home recognize and respond to your voice alone. This
project will explore these ideas by developing voice recognition software, and
then demonstrating that software through a basic implementation of a voice
recognition system on readily available electronic components and hardware.
The system will respond to a list of defined spoken commands, and control two
basic household appliances, a lamp, and an LCD screen, mimicking a thermostat.
Goal of Project:
A full realization of this concept would involve a few distinct steps. First, develop
a database of commands the system will respond to. Second, develop voice
recognition software that can perform a comparison between a command issued
to the system, and the database of commands. Third, develop a sufficient
hardware system to translate a matched command into a control signal, and
finally into a realized change of state in hardware. Forth, develop the above
system to exist on a programmable DSP chip such that it operates independently
of an external computing source, and interacts with its hardware inputs and
outputs independently. Such a system would be integrated in the user’s home,
use microphones installed in the home as input sources, and would issue control
signals to hardware already installed in the home.
Realization:
As a full realization of this concept is beyond the time and budgetary constraints
of this project, we plan instead to prove the concept by designing a
demonstration circuit that will operate as a scaled down version of the above
system. The aim of this project then, is to prove the concept of speech activated
appliances by developing a voice recognition system to recognize 5 user-spoken
commands in real time, and issue control signals to a pair of recognizable
household appliances.
3




A standard PC microphone will be used as the input source
The voice recognition software will be written in MatLab, to be run off of a
desktop PC
A simple hardware interface will be developed to translate control signals
into a change of state in the appliances
The appliances we have chosen to demonstrate the control aspect of the
project are a light and a thermostat.
o The thermostat will be simulated with two seven segment LED
displays showing the current temperature “set-point” of the
demonstration thermostat
o The lamp will be realized with a standard 120 V desk lamp.
Demonstration System Operation:
A fully function system will operate as follows. An individual will approach the
microphone and issue one of 5 of pre-recorded commands. “On” or “Dark” to
control the lamp, or “Fifteen”, “Twenty”, or “Twenty-Five” to change the set-point
of the thermostat. The analog signal of this command will then be converted to a
digital signal. After A/D conversion of the signal, software will process the signal
and store it in memory. The stored information will then be compared to the
information stored in a database of pre-recorded commands via a speech
recognition algorithm. When a match is made a control signal will be issued to
the output interface circuitry, which will control the appliances. This will occur in
real time, optimized for minimum delay.
Here is a flow chart of this process.
[Analog Input]
I
[A/D Conversion]
I
[Software Processing]
I
[Database Comparison Via Speech recognition Algorithm]
I
[On match issue control signal over Serial output interface]
I
[Serial connection]
I
[Hardware Interface Controls Hardware]
4
Figure 1 - General Project Layout
Software Concept:
A large amount of the work is done in the signal during the Software processing
part of the system. Through our research of voice recognition software, we have
further distinguished three distinct parts to create a software package that will
effectively process and recognize our spoken commands. These parts are:
Feature extraction: Analyses both pre-recorded database samples and live
signals from the microphone input and distils them down into a matrix that
describes that sample with cepstrum, log energy, delta cepstrum, and delta delta
cepstrum numerical coefficients.
Database creation: Holds the acquired matrices from the feature extraction in
memory. Generally, a frame of 20-25 ms extracted from the recorded command
can have up to 40 numeric values called parameters. Considering a recording
length of about 1 second, each command can have up to 2000 parameters. For a
set of five commands as is to be implemented in this project, the database could
have up to 10000 numeric values.
5
Our demo system database has 5 stored commands recorded 5 times for 5
different users (group members). The 5 commands were also recorded an
additional 2 times by 2 female users.
Pattern Recognition Algorithm: Dynamic Time warping algorithm compares
the features extracted from the live input signal and compares it against the
features stored in the database, and returns a match for the database dataset
that most closely recognizes the input signal.
A match will then trigger the output interface to communicate an ascii character
describing the matched command over the output interface and into the
hardware interface.
Output interface and Hardware interface Concept:
Figure 2 - Hardware Layout
The output interface will be a serial connection comprising of code in Matlab
configured to communicate over the serial connection of Kunal’s Personal
Desktop PC, to the demonstration circuit. This will interface with the Hardware
interface on the demonstration circuit, which will receive the serial signal and
translate the data into a control signal via a programmed microcontroller.
To translate the serial signals into control signals we plan to develop a small
hardware demonstration circuit consisting of the following components:
6

A Serial port connection to receive signals from the PC

A preprogrammed microcontroller to route signals to either the light or the
thermostat.

A relay to translate a 5VDC control circuit signal to 120 VAC for our light

Two LED displays with appropriate control chips to display thermostat
settings

This device will be packaged and wired appropriately to interface to a
120V desk lamp and the LCD displays for the thermostat.
Upon coordinating with the 499 Project technicians in the engineering
department, we settled on the following setup to achieve the above design.
Serial Output from PC ( cable and connectors)
MAX 232 Serial Line Driver Chip
Atmel ATMEGA8 8 bit flash programmable microcontroller
Texas Instruments SN74LS47 BCD to seven segment Driver Chips
Common Anode 7 segment LED’s
5VDC to 120V relay (Wired box to house relay and circuit connections for 120V
outlets and Wall connection to 120V)
The rest of the demonstration circuit will be constructed with basic off shelf lab
components including such things as: Resistors, Transistors, Relays, and wiring.
The thermostat demonstration circuit will be implemented with control signals
from the microcontroller feeding the BCD driver chips to drive the 7 segment
LED’s to the right display values.
7
Figure 3 - Hardware Schematic
2.0 Background
Speech recognition:
Speech recognition is an advanced form of decision making whereby the
input originates with the spoken work of a human user. Ideally, this is the only
input that is required.
There are many ways in which speech recognition can be implemented.
For the proposes of this report, it is assumed that a microphone connected to a
computer is available. On the computer, a program called MATLAB is used to
implement the algorithm and store the database.
Requirements for speech recognition
1. A database
This database serves as the main point of comparison. When an
input is directed to the algorithm, the algorithm compares it to what
8
is contained in the database (discussed below) using an algorithm
(discussed below) that maximises accuracy while minimising
computing time.
2. Input
Input in this case comes in the form of a microphone connected
through a computer to MATLAB.
3. Algorithm for comparison
As the main computing point, the algorithm dictates the speed and
accuracy for the whole system. For example, a point by point
comparison of the input to the database would be costly in terms of
time and be highly inaccurate. Much work has been done to find an
algorithm which provides the benefits required to make a practical
speech recognition system. Algorithms are listed below.
Information contained in Database
The database contains the features of the pre-recoded commands.
Features include:
1. MFCC’s
The procedure for extracting MFCCs is:
a. Take the Fourier transform of the signal, (done for each window)
b. Map the log amplitudes of the spectrum onto the Mel scale.
Triangular overlapping windows are used.
Mel scale: To convert between f hertz into m mel:
m= 1127.01048 loge(1+f/700)
For m mel into f hertz:
f= 700 (exp(m/1127.01048) – 1)
c. Take the Discrete Cosine Transform of the list of Mel logamplitudes
d. The amplitudes of the resulting spectrum are the MFCCs. [ref
http://en.wikipedia.org/wiki/Mel_scale]
2. Delta
a. Taken by differentiating the MFCC coefficients to find the first
order degree of change
9
3. Delta-delta
a. Found by differentiating the delta coefficients to find the second
order degree of change in MFCCs.
4. Energy
a. The log energy of the signal is computed by using overlapping
triangular windows.
The figure below illustrates the feature extraction in a flow chart.
Figure5 - Feature extraction.
Types of Algorithms
1. Dynamic Time Warping (DTW)
Dynamic time warping is a method that is most applicable to signals
which are skewed or shifted in time relative to each other. For
example if one signal is compared to another signal that is the
same signal but shifted in the x (time) axis, a point to point
Euclidian comparison with give a large error. However, if the shift
is accounted for, as it is in DTW, the two signals will be recognised
as being very similar, which they are. In this way, DTW is ideal for
speech recognition, where one word spoken by two users is never
exactly the same, but often said with differing speed or emphasis.
In the below figure, the input signal and template signal are
compared. If the two signals were exactly the same, the minimum
distance path would be a 45 degree angle between the two.
However, any skew will cause the minimum distance maping to
10
shift. DTW takes advantage of this fact and gives a distance which
accounts for this shift.
Figure 6 - Warping path
[Ref: http://www.cse.unsw.edu.au/~waleed/phd/html/node38.html]
Two signals warped in time:
Figure 7 - Two skewed signals
As shown above, the two signal are in fact, the same. However, a
Euclidian comparison would give a large difference. DTW, skews
the difference computed much like B above.
11
DTW skews the difference by computing the minimum distance between
the two signals. Here, a minimum distance warping path is shown:
Figure 8 - Minimum distance warping path
[Ref for above two pics: http://www.cs.ucr.edu/~eamonn/sdm01.pdf]
2. Hidden Markov Model (HMM)
The HMM algorithm is a statistical model. The process is assumed
to be a Markov process with hidden (unknown) parameters. The
hidden parameters are deduced by analysing the known
parameters. By computing these states, pattern recognition is
possible, which is how HMM can be used for speech recognition.
HMM is a complex algorithm which provides the most benefit for
use with a large vocabulary system. In this project, five commands
were recognised by the system.
3. Neural Networks (NN)
Neural networks use a network of “neurons” of acoustic phonemes
which are compared to the input to find a match. NN is a highly
mathematical system which is useful for computing larger words
with many phonemes in them. In this project, words were kept
short.
12
In light of the above information, dynamic time warping was judged to be the best
choice for this project.
Other concepts used in this report
Windowing
Windowing a signal ensures that there are no sharp cut-offs at the
beginning or end of a signal, which can cause unwanted high frequencies to
appear in the signal. In this project a Hamming window is used with a formula of:
w(n) = 0.53836 - 0.46164 * cos(2*pi*n/(N))
Modelling of speech
To create the database and to use analog speech in the digital domain it is
necessary to accurately and succinctly model it. A simple analog to digital
conversion is too large and not useful enough to be used for comparison with
high accuracy. Speech and speech production is modelled as follows:
Speech consists of voiced and unvoiced sounds. Voiced sounds such as ‘a’ and
‘b’ are due to vibrations of the vocal chords and can be accurately modeled as
the sum of sinusoids. Unvoiced sounds are, when looked at in a short time
frame (say 10 ms), noise. They can be modeled as such.
Voiced sounds are made by allowing air to pass freely over the vocal chords out
of the vocal tract.
Unvoiced sounds are made by a constricted vocal tract, producing turbulence.
The vocal tract changes in time to produce voiced and unvoiced sounds in
succession. This is speech.
Speech can be modelled as a time varying signal. However since the vocal tract
varies slowly with respect to the pitch of speech (the pitch of speech is typically
close to 8 kHz), speech can be modelled as a LTI system (linear time invariant)
where the transfer function of the vocal tract, v(t) for short time frames is
convolved with a driving impulse train, x(t), producing sound, s(t).
13
Figure 9 - Magnitude response of the vocal tract
14
Figure 10 – The vocal tract as an LTI filter.
[Ref: http://cobweb.ecn.purdue.edu/~ipollak/ee438/FALL04/notes/Section2.2.pdf]
To model the vocal tract as a digital filter, the poles of its transfer function can be
computed.
Figure 11 - Poles near the unit circle correspond to large values of H(e^jw).
The location of the poles depends on the resonant frequencies which occur in the
vocal tract. These are called formant frequencies.
15
3.0 Database
Database is constructed from the pre-recorded commands uttered by all the
team members. In addition to the team members, few more people were also
recorded to ensure more recognition and speaker independency.
Database Design
Each entry in the database is designed to correspond to a single command
utterance. The entries in the database were designed to be a feature matrix
containing features extracted from the pre-recorded samples. There was one
entry for all the pre-recorded commands.
Database construction
There are in all 155 commands in the database. Five utterances per command
for five commands and five team members gives 125 commands. The rest of the
commands were recorded by external people.
The entries in the database correspond to each command. The entries are in
form of feature matrices extracted from the vocal commands. The formation of
the feature matrices is summarized in the next section-‘Feature Extraction’.
Feature Extraction
The overview of the feature extraction is presented in the diagram below. The
input files are either digitized first if not pre-recorded commands, and split into
short-time frames. The ‘Cepstral coefficient extraction’ returns MFCC coefficients
and the Frame Energy. These coefficients and energy is further processed to
derive the Delta-Cepstral and Delta-delta Cepstral coefficients.
16
Figure 12 - Feature Extraction process
The Cepstal coefficient extraction can be shown in plots in the following section.
Cepstral Coefficient Extraction
The block diagram of the cepstal coefficient extraction block is as shown below:
Figure 13 - Cepstral coefficient calculation from frames
As shown above, Fast fourier transform is applied to each of the frames of the
digitized command.
The next step is to calculate the frame energy. The fourier transform gives
complex values out in its result. In order to make use of those values, they must
first be converted to real values. The Absolute value of a complex number
returns the magnitude of the complex numbers in the array in real numbers, and
the real numbers are squared to calculate the energy. The magnitudes of the
Fast fourier transform are plotted in the figure below:
17
Figure 14 - FFT Magnitude spectrum for a sample frame
These magnitude squares are summed up to form frame energy as one of the
parameters. The magnitude squares are also passed downstream to the MelFilter banks for further processing. The Mel Filter bank are filters designed based
on the Mel Frequency scale. The Mel frequency scale is designed so that it
represents the way human percept the sound. The frequency mapping in the
Mel scale is shown with respect to the linear frequency below:
18
Figure 15 - MEL frequency vs Linear frequency
The Mel Filter bank is pictured in the figure below.
Figure 16 - Mel Filter bank magnitudes
19
The Discrete Cosine Transform (DCT) is applied on the signal after it is filtered
by the Mel Filter bank and taken the log value of. The result of the DCT is the
Mel-Frequency Cepstral Coefficients. The cepstral coefficients for a sample
frame can be shown as below:
Figure 17 - Mel Frequency Cepstral Coefficients for a sample frame
Feature matrix and database formation
In the project, there were in total 39 features per frame:
1. 12 MFCC coefficients
2. 1 Frame energy
3. 13 Delta-cepstral
4. 13 Delta-delta cepstral
The feature matrix is a 2D matrix containing 39 rows for the number of features
and ‘x’ number of columns for ‘x’ number of frames in the command. This feature
matrix is than inserted into the database at a certain index that is mapped to the
command input it was extracted from. Thus, the Database is a 3D matrix where
each 2D matrix corresponds to a command. The block diagram below describes
the structure of the Database.
20
Figure 17 - Database structure
Repeating the feature extraction and storing process for all the command
utterances, the final size of the database was 39x198x155. For the database
 39 represents number of rows
 198 represents number of frames (maximum)
 155 represents number of commands
Database comparison
For any command input to the program, the features are extracted in the same
way as the pre-recorded commands and stored to a feature matrix on a frameby-frame basis. The resulting feature matrix is a 2D matrix of features.
To detect the command that was uttered, the feature matrix constructed from the
input must be compared with the feature matrices inside the database. The
algorithm used for the comparison of the feature matrices was Dynamic Time
Warping (DTW). DTW has been described in detail in the background section.
The flow-chart below describes the algorithm for the comparison part:
21
Figure 18 - Database comparison algorithm flow chart
As shown in the diagram above, feature matrices are pulled off of the database
individually and compared with the input feature matrix. An important note while
retrieving matrices from the database is to consider the zero frames. The way
MATLAB works with arrays and matrices is that it assigns the matrix dimensions
dynamically, and it expands to accommodate any number of columns, as
required. This causes a problem when there is a need to expand the array or the
matrix. The question to be asked is what happens with the other elements which
have less number of columns. There is only one size to the Database matrix, and
that is the largest feature matrix. So, what happens to those feature matrices
which have dimensions less than the dimensions provided by the Database
matrix? What values would be seen in the empty cells of the smaller matrix? The
answer is zeros. MATLAB appends zeros in the empty cells of the smaller
feature matrices.
The problems with padding of zeros are several:
 Corrupts the feature matrix with false values
 Adds unnecessary computations
 Comparison algorithm yields incorrect matches
To have good recognition, the zeros need to be removed while comparing the
feature matrices with the input feature matrix. The ‘Strip Zeros’ block in the
algorithm does just that. It removes the zeros from the feature matrices before
passing them on to the comparison algorithm. Thus, the comparision now yields
reliable and accurate results. This was one of the main challenges that we came
22
across in the project. We were having very minimal accuracy rates until we
corrected this. Once it was corrected, the recognition started working really well.
Once the comparison is completed using DTW, the program returns a cumulative
distance value. It is a representative distance of the difference between the two
matrices. These cumulative distances with the input features are collected for all
the feature matrices in the database. Once all the distances have been collected,
the minimum distance among all comparisons is searched, and passed on as the
matching feature matrix. The code then maps the index of the feature matrix to
the command index, and hence the command is identified.
The serial port is driven with the output associated with the command, which then
controls the hardware. More details about the hardware interface can be found in
the ‘Hardware’ heading under ‘Background’ section.
4.0 Dynamic Time Warping (DTW)
In our project, speech signal is represented by a series of feature vectors which
are computed every 10ms. A whole word will comprise dozens of those vectors,
and we know that the number of vectors (the duration) of a word will depend on
how fast a person is speaking. In speech recognition, we have to classify
sequences of vectors. Therefore, we need to find a way to compute a distance
between the unknown sequences of vectors X and known sequences of vectors
W, which are prototypes for the words we want to recognize.
Distance between Two Sequences of Vectors.
A classification of a spoken utterance would be easy if we had a good distance
measure D( X, W ) at hand. To get a good distance measure, the distance
measure must
 Measure the distance between two sequences of vectors of different
length
 While computing the distance, find an optimal assignment between the
individual feature vectors
 Compute a total distance out of the sum of distances between individual
pairs of feature vectors
23
Comparing the distance between two sequences of vectors of
different length
In dynamic time warping method (DTW), when comparing sequences with
different length, the sequence length is modified by repeating or omitting some
frames, so that both sequences have the same length as shown figure 1 below.
This modification of sequences is called time warping.
Figure 19 – Linear Time Warping
[Ref:http://www.tik.ee.ethz.ch/~gerberm/PPS_Spracherkennung/SS06/lecture4.p
df]
As it can be seen from figure 1, the two sequences X and W consist of six and
eight vectors, respectively. The sequence W was rotated by 90 degrees, so that
the time index for this sequence runs from the bottom of the sequence to its top.
The two sequences span a grid of possible assignments between the vectors.
Each path through this grid (as the path shown in the figure) represents one
possible assignment of the vector pairs. For example, the first vector of X is
assigned the first vector of W, the second vector of X is assigned to the second
vector of W, and so on. As an example the, let us assume that path P is given by
the following sequence of time index pairs of the vector sequences.
P = {(0, 0) , (1, 1) , (2, 2) , (3, 2) , (4, 2) , (5, 3) , (6, 4) , (7, 4)} (4.3)
The length of path P is determined by the maximum of the number of vectors
contained in X and W. The assignment between the time indices of W and X as
given by P can be interpreted as” time warping” between the time axes of W and
X. In our example, the vectors x2, x3 and x4 were all assigned to w2, thus
warping the duration of w2 so that it lasts three time indices instead of one. By
this kind of time warping, the different lengths of the vector sequences can be
compensated. And for the given path P, the distance measure between the
24
vector sequences can be computed as the sum of the distances between the
individual vectors.
Finding the Optimal Path
Once we have the path, computing the distance becomes a simple task. DTW
distance can be computed efficiently by using Bellman’s principle of optimality. It
states that If optimal path is the path through the matrix of grid points beginning
at A and ending at B , and the grid point K is part of path, then the partial path
from A to B is also part of optimal path. From that, we can construct a way of
iteratively finding our optimal path P as shown on figure 2.
Figure 20 - nonlinear path options
[Ref:http://www.tik.ee.ethz.ch/~gerberm/PPS_Spracherkennung/SS06/lecture5.p
df]
According to this principle, it is not necessary to compute all possible paths P
and corresponding distances to find the optimum path. Out of the huge number
of theoretically possible paths, only a fraction is computed. To illustrate this
concept further, we need to discuss what is called a local path alternative or local
distance.
Local distances
Since both sequences of vectors represent feature vectors measured in short
time intervals, we can restrict the time warping to reasonable boundaries. The
first vectors of X and W should be assigned to each other as well as their last
vectors. For the time indices in between, we want to avoid any big leap backward
or forward in time, but want to restrict the time warping just to the reuse of the
preceding vector(s) to locally warp the duration of a short segment of speech
25
signal. With these restrictions, we can draw a diagram of possible local path
alternatives for one grid point and its possible predecessors.
As we can see, a grid point (i, j) can have the following
Predecessors:
Figure 21 – Accumulated Distance in point (i, j)



(i − 1, j) : horizontal local path
(i − 1, j − 1) : diagonal local path
(i, j − 1) : vertical local path
[Ref:http://www.tik.ee.ethz.ch/~gerberm/PPS_Spracherkennung/SS06/lecture5.p
df]
All possible paths P, which we will consider as possible candidates for being the
optimal path, can be constlructed as a concatenation of the local path
alternatives as described above.
According to the local path alternatives diagram shown above, there are only
three possible predecessor paths leading to a grid point (i, j): The partial paths
from (0, 0) to the grid points (i − 1, j), (i − 1, j − 1) and (i, j − 1) ). The (globally)
optimal path from (0, 0) to grid point (i, j) can be found by selecting exactly the
one path hypothesis among our alternatives which minimizes the accumulated
distance A(i, j) of the resulting path from (0, 0) to (i, j).
Starting from grid point (0,0) to the vector distances defined by the grid point (1,0)
and (0,1), we can compute A(1,0) and A(0,1). Now we look at the points which
can be computed from the three points we just finished. For each of these points
(i, j), we search the optimal predecessor point out of the set of possible
predecessors. That way we walk trough the matrix from bottom-left to top-right.
Once we reached the top–right corner of our matrix, the accumulated distance A
(RefFrames, TestFrames) is the distance D( W , X) between the vector
sequences.
26
5.0 Experiments & Results
In addition to the results noted on the presentation day, this section will also
present results for the recognition rates for the team members and five external
speakers with varied accents.
On the presentation day, the project had more success with the ‘On’ and ‘Dark’
commands. The other commands suffered in the accuracy for external speakers.
The other commands are ‘Fifteen’,’Twenty’ and ‘Twentyfive’. One reason for this
huge difference in the recognition could be the length of the commands. ‘On’ and
‘Dark’ are short, and hence different accents and speech variation affect it less.
However for the other longer commands, the pitch and the accents come in to
picture and hence suffer in accuracy.
On the presentation day:
 We had close to 100% accuracy for the speakers that the database was
trained to
 We had about 80% accuracy for all speakers for the command ‘On’
 We had about the same accuracy (80%) for the command ‘Dark’ for all
speakers
 Many people did not try the other commands, but for some that did
o ‘Fifteen’ had about 60% accuracy
o ‘Twenty’ had about same (60%) hit rate
o ‘Twentyfive’ was the worst with meager hit rate of about 40%
For the report, it is planned to test the program in the following way:
 10 repeats for all command by all the team members and in additions five
external speakers
o The hit rate will be treated as the accuracy
 For five commands, it is 50 repeats per speaker and for 10 speakers, total
test samples will be 500
 There will be results classified by commands, speaker type-external or
database, and overall-including all speakers
27
6.0 Hardware
Figure 22 - Hardware schematic
Figure 23 - 5V DC to 120V AC wiring box schematic
28
Command
name
Fifteen
Device Status
Ascii Characters
BCD
Chip 1
input
Relay input
15
c
Atmel
Pin
PB0
PC0
PC1
PC2
1
0
0
PC3
1
PC4
0
PC5
1
Table 1 – Micro- controller signal route table.
BCD
Chip 2
input
Twenty Twentyfive
2 LED's Display
20
25
w
h
0
1
0
0
1
0
0
0
0
1
0
1
On
Dark
Light on
o
Light off
f
1
0
The microcontroller (ATMEL) chip is programmed as the table above – Bits string
will be sent from the DSP chip (or computer desktop through serial port during
demonstration) in Ascii characters as shown on table. When the specific
characters is recognised, for instance “o” , pin out PB0 will send a true signal to
turn on the relay - thus the lamp will be turned on ; when “f” is detected, same
pin out will send a false signal to turn off the lamp. 6 pin outs (PC0 to PC5) will
be route to two BCDs. When characters “c”, “w”, or “h” is detected, the binary bits
will operate as shown in table above to display the temperatures in 2 7-segment
LEDs to display temperatures “15”, “20”, or “25”.
29
7.0 Conclusion
The speech recognition was implemented in MATLAB using several features
extractions, such as : cepstrum, log energy, delta cepstrum, and delta delta
cepstrum numerical coefficients. Research was done on a few algorithms Neural Networks (NN), Hidden Markov Model (HMM), and Dynamic Time
Warping (DTW). It has come to the conclusion that our speech recognition was
performed using Dynamic Time Warping (DTW). Dynamic Time warping
algorithm compares the features extracted from the live input signal and it was
compared against the features stored in the database and if match of database is
found, the output interface is triggered to produce serial of bits string as output.
Matched commands are described in ascii character and bits strings were sent
into the hardware interface and were driven by the Atmel chip to route the signal
accordingly.
Five commands - On, Dark, Fifteen, Twenty, and Twenty five were recorded in
different samples (high pitch, low pitch, slow, fast, and normal) by various
speakers with different accents for both males and females. Results on the
presentation day was listed as below
 100% accuracy for the speakers that the database was trained to
 80% accuracy for the command ‘On’
 80% for the command ‘Dark’ had about
 60% accuracy for the command ‘Fifteen’
 60% accuracy for the command ‘Twenty’
 40% accuracy for the command ‘TwentyFive’
However, these results are not final, another recording will be done soon to
gather a larger database by having more speakers to have their commands
recorded to increase the accuracy. For the final report, it is planned to test the
program in the following way:
 10 repeats for all command by all the team members and in additions five
external speakers
o The hit rate will be treated as the accuracy
 For five commands, it is 50 repeats per speaker and for 10 speakers, total
test samples will be 500
30
8.0 References
[1] Mohammed Waleed Kadous. “Dynamic Time Warping”
http://www.cse.unsw.edu.au/~waleed/phd/html/node38.html
(2002-12-10)
[2] Eamonn J. Keogh† and Michael J. Pazzani‡ “Derivative Dynamic Time
Warping” http://www.cs.ucr.edu/~eamonn/sdm01.pdf
(2003 –10- 16)
[3] Prof. Ilya Pollak “Speech Processing”
http://cobweb.ecn.purdue.edu/~ipollak/ee438/FALL04/notes/Section2.2.pdf
(2004 -09 -05)
[4] Michael Gerber. “PPS on Speech recognition”
http://www.tik.ee.ethz.ch/~gerberm/PPS_Spracherkennung/SS06
(2006 –01- 05)
[5]Signal Modeling Techniques in Speech Recognition
JOSEPH W. PICONE, SENIOR MEMBER, IEEE
PROCEEDINGS OF THE IEEE, VOL. 81
(1993 – 09 – 09)
[6] “Window Function”, http://en.wikipedia.org/wiki/Hamming_window
(2006-12-12)
[7] Minh N. Do, “An Automatic Speaker Recognition System”
[8] B. Plannerer,”An Introduction to Speech Recognition”
http://www.speech-recognition.de/textbook.html
(2003 – 05-18)
[9] Longbiao Wang, Norihide Kitaoka, Seiichi Nakagawa, “Robust Distant
Speaker Recognition Based on Position Dependent Cepstral Mean
Normalization”, 2005
[10] “Dynamic time warping”, 16 Dec 2006,
http://en.wikipedia.org/wiki/Dynamic_time_warping
[11] Stuart N Wrigley, “Speech Recognition by Dynamic Time Warping”,
http://www.dcs.shef.ac.uk/~stu/com326/sym.html
31
[12] “Formant”
http://en.wikipedia.org/wiki/Formant
(2006-12-16)
[13] “Signal Energy vs. Signal Power”
http://cnx.org/content/m10055/latest/
(2004-08-12)
[14] Lawrence R. Rabiner, “A Tutorial on Hidden Markov Models and Selected
Applications in Speech Recognition” Proceedings of the IEEE, 77 (2), p. 257–286,
(1989-02-09)
32
Download