University of Victoria Faculty of Engineering Spring 2007 Elec499B Final Report Speech Activated Appliances Group Members David Beckstrom Harun Ali Kunal Jadav Nicole Su Lee, Ng Zorawar Bhatia In partial fulfillment of the requirements of the B.Eng.Degree 1 Table of Content 1.0 Introduction ................................................................................................................... 3 Concept: .......................................................................................................................... 3 Goal of Project: ............................................................................................................... 3 Realization: ..................................................................................................................... 3 Demonstration System Operation: .................................................................................. 4 Software Concept: ........................................................................................................... 5 Feature extraction........................................................................................................ 5 Database creation: ....................................................................................................... 5 Pattern Recognition Algorithm: .................................................................................. 6 Output interface and Hardware interface Concept: ........................................................ 6 ......................................................................................................................................... 6 2.0 Background ................................................................................................................... 8 Speech recognition: ......................................................................................................... 8 Requirements for speech recognition.............................................................................. 8 Information contained in Database ................................................................................. 9 Types of Algorithms ................................................................................................. 10 Other concepts used in this report ................................................................................. 13 Windowing................................................................................................................ 13 Modelling of speech .................................................................................................. 13 3.0 Database ...................................................................................................................... 16 Database Design............................................................................................................ 16 Database construction ................................................................................................... 16 Feature Extraction ..................................................................................................... 16 Feature matrix and database formation ..................................................................... 20 Database comparison .................................................................................................... 21 4.0 Dynamic Time Warping (DTW) ................................................................................. 23 Distance between Two Sequences of Vectors. ............................................................. 23 Comparing the distance between two sequences of vectors of different length ....... 24 Finding the Optimal Path .............................................................................................. 25 Local distances .......................................................................................................... 25 5.0 Experiments & Results ............................................................................................... 27 6.0 Hardware ................................................................................................................... 288 7.0 Conclusion .................................................................................................................. 30 8.0 References ................................................................................................................... 31 2 1.0 Introduction Concept: This project will demonstrate the next generation of home automation technology, speech activated appliances. Imagine being able to control the lights in a room, or the temperature of your home with a simple spoken command. Imagine the security of having your home recognize and respond to your voice alone. This project will explore these ideas by developing voice recognition software, and then demonstrating that software through a basic implementation of a voice recognition system on readily available electronic components and hardware. The system will respond to a list of defined spoken commands, and control two basic household appliances, a lamp, and an LCD screen, mimicking a thermostat. Goal of Project: A full realization of this concept would involve a few distinct steps. First, develop a database of commands the system will respond to. Second, develop voice recognition software that can perform a comparison between a command issued to the system, and the database of commands. Third, develop a sufficient hardware system to translate a matched command into a control signal, and finally into a realized change of state in hardware. Forth, develop the above system to exist on a programmable DSP chip such that it operates independently of an external computing source, and interacts with its hardware inputs and outputs independently. Such a system would be integrated in the user’s home, use microphones installed in the home as input sources, and would issue control signals to hardware already installed in the home. Realization: As a full realization of this concept is beyond the time and budgetary constraints of this project, we plan instead to prove the concept by designing a demonstration circuit that will operate as a scaled down version of the above system. The aim of this project then, is to prove the concept of speech activated appliances by developing a voice recognition system to recognize 5 user-spoken commands in real time, and issue control signals to a pair of recognizable household appliances. 3 A standard PC microphone will be used as the input source The voice recognition software will be written in MatLab, to be run off of a desktop PC A simple hardware interface will be developed to translate control signals into a change of state in the appliances The appliances we have chosen to demonstrate the control aspect of the project are a light and a thermostat. o The thermostat will be simulated with two seven segment LED displays showing the current temperature “set-point” of the demonstration thermostat o The lamp will be realized with a standard 120 V desk lamp. Demonstration System Operation: A fully function system will operate as follows. An individual will approach the microphone and issue one of 5 of pre-recorded commands. “On” or “Dark” to control the lamp, or “Fifteen”, “Twenty”, or “Twenty-Five” to change the set-point of the thermostat. The analog signal of this command will then be converted to a digital signal. After A/D conversion of the signal, software will process the signal and store it in memory. The stored information will then be compared to the information stored in a database of pre-recorded commands via a speech recognition algorithm. When a match is made a control signal will be issued to the output interface circuitry, which will control the appliances. This will occur in real time, optimized for minimum delay. Here is a flow chart of this process. [Analog Input] I [A/D Conversion] I [Software Processing] I [Database Comparison Via Speech recognition Algorithm] I [On match issue control signal over Serial output interface] I [Serial connection] I [Hardware Interface Controls Hardware] 4 Figure 1 - General Project Layout Software Concept: A large amount of the work is done in the signal during the Software processing part of the system. Through our research of voice recognition software, we have further distinguished three distinct parts to create a software package that will effectively process and recognize our spoken commands. These parts are: Feature extraction: Analyses both pre-recorded database samples and live signals from the microphone input and distils them down into a matrix that describes that sample with cepstrum, log energy, delta cepstrum, and delta delta cepstrum numerical coefficients. Database creation: Holds the acquired matrices from the feature extraction in memory. Generally, a frame of 20-25 ms extracted from the recorded command can have up to 40 numeric values called parameters. Considering a recording length of about 1 second, each command can have up to 2000 parameters. For a set of five commands as is to be implemented in this project, the database could have up to 10000 numeric values. 5 Our demo system database has 5 stored commands recorded 5 times for 5 different users (group members). The 5 commands were also recorded an additional 2 times by 2 female users. Pattern Recognition Algorithm: Dynamic Time warping algorithm compares the features extracted from the live input signal and compares it against the features stored in the database, and returns a match for the database dataset that most closely recognizes the input signal. A match will then trigger the output interface to communicate an ascii character describing the matched command over the output interface and into the hardware interface. Output interface and Hardware interface Concept: Figure 2 - Hardware Layout The output interface will be a serial connection comprising of code in Matlab configured to communicate over the serial connection of Kunal’s Personal Desktop PC, to the demonstration circuit. This will interface with the Hardware interface on the demonstration circuit, which will receive the serial signal and translate the data into a control signal via a programmed microcontroller. To translate the serial signals into control signals we plan to develop a small hardware demonstration circuit consisting of the following components: 6 A Serial port connection to receive signals from the PC A preprogrammed microcontroller to route signals to either the light or the thermostat. A relay to translate a 5VDC control circuit signal to 120 VAC for our light Two LED displays with appropriate control chips to display thermostat settings This device will be packaged and wired appropriately to interface to a 120V desk lamp and the LCD displays for the thermostat. Upon coordinating with the 499 Project technicians in the engineering department, we settled on the following setup to achieve the above design. Serial Output from PC ( cable and connectors) MAX 232 Serial Line Driver Chip Atmel ATMEGA8 8 bit flash programmable microcontroller Texas Instruments SN74LS47 BCD to seven segment Driver Chips Common Anode 7 segment LED’s 5VDC to 120V relay (Wired box to house relay and circuit connections for 120V outlets and Wall connection to 120V) The rest of the demonstration circuit will be constructed with basic off shelf lab components including such things as: Resistors, Transistors, Relays, and wiring. The thermostat demonstration circuit will be implemented with control signals from the microcontroller feeding the BCD driver chips to drive the 7 segment LED’s to the right display values. 7 Figure 3 - Hardware Schematic 2.0 Background Speech recognition: Speech recognition is an advanced form of decision making whereby the input originates with the spoken work of a human user. Ideally, this is the only input that is required. There are many ways in which speech recognition can be implemented. For the proposes of this report, it is assumed that a microphone connected to a computer is available. On the computer, a program called MATLAB is used to implement the algorithm and store the database. Requirements for speech recognition 1. A database This database serves as the main point of comparison. When an input is directed to the algorithm, the algorithm compares it to what 8 is contained in the database (discussed below) using an algorithm (discussed below) that maximises accuracy while minimising computing time. 2. Input Input in this case comes in the form of a microphone connected through a computer to MATLAB. 3. Algorithm for comparison As the main computing point, the algorithm dictates the speed and accuracy for the whole system. For example, a point by point comparison of the input to the database would be costly in terms of time and be highly inaccurate. Much work has been done to find an algorithm which provides the benefits required to make a practical speech recognition system. Algorithms are listed below. Information contained in Database The database contains the features of the pre-recoded commands. Features include: 1. MFCC’s The procedure for extracting MFCCs is: a. Take the Fourier transform of the signal, (done for each window) b. Map the log amplitudes of the spectrum onto the Mel scale. Triangular overlapping windows are used. Mel scale: To convert between f hertz into m mel: m= 1127.01048 loge(1+f/700) For m mel into f hertz: f= 700 (exp(m/1127.01048) – 1) c. Take the Discrete Cosine Transform of the list of Mel logamplitudes d. The amplitudes of the resulting spectrum are the MFCCs. [ref http://en.wikipedia.org/wiki/Mel_scale] 2. Delta a. Taken by differentiating the MFCC coefficients to find the first order degree of change 9 3. Delta-delta a. Found by differentiating the delta coefficients to find the second order degree of change in MFCCs. 4. Energy a. The log energy of the signal is computed by using overlapping triangular windows. The figure below illustrates the feature extraction in a flow chart. Figure5 - Feature extraction. Types of Algorithms 1. Dynamic Time Warping (DTW) Dynamic time warping is a method that is most applicable to signals which are skewed or shifted in time relative to each other. For example if one signal is compared to another signal that is the same signal but shifted in the x (time) axis, a point to point Euclidian comparison with give a large error. However, if the shift is accounted for, as it is in DTW, the two signals will be recognised as being very similar, which they are. In this way, DTW is ideal for speech recognition, where one word spoken by two users is never exactly the same, but often said with differing speed or emphasis. In the below figure, the input signal and template signal are compared. If the two signals were exactly the same, the minimum distance path would be a 45 degree angle between the two. However, any skew will cause the minimum distance maping to 10 shift. DTW takes advantage of this fact and gives a distance which accounts for this shift. Figure 6 - Warping path [Ref: http://www.cse.unsw.edu.au/~waleed/phd/html/node38.html] Two signals warped in time: Figure 7 - Two skewed signals As shown above, the two signal are in fact, the same. However, a Euclidian comparison would give a large difference. DTW, skews the difference computed much like B above. 11 DTW skews the difference by computing the minimum distance between the two signals. Here, a minimum distance warping path is shown: Figure 8 - Minimum distance warping path [Ref for above two pics: http://www.cs.ucr.edu/~eamonn/sdm01.pdf] 2. Hidden Markov Model (HMM) The HMM algorithm is a statistical model. The process is assumed to be a Markov process with hidden (unknown) parameters. The hidden parameters are deduced by analysing the known parameters. By computing these states, pattern recognition is possible, which is how HMM can be used for speech recognition. HMM is a complex algorithm which provides the most benefit for use with a large vocabulary system. In this project, five commands were recognised by the system. 3. Neural Networks (NN) Neural networks use a network of “neurons” of acoustic phonemes which are compared to the input to find a match. NN is a highly mathematical system which is useful for computing larger words with many phonemes in them. In this project, words were kept short. 12 In light of the above information, dynamic time warping was judged to be the best choice for this project. Other concepts used in this report Windowing Windowing a signal ensures that there are no sharp cut-offs at the beginning or end of a signal, which can cause unwanted high frequencies to appear in the signal. In this project a Hamming window is used with a formula of: w(n) = 0.53836 - 0.46164 * cos(2*pi*n/(N)) Modelling of speech To create the database and to use analog speech in the digital domain it is necessary to accurately and succinctly model it. A simple analog to digital conversion is too large and not useful enough to be used for comparison with high accuracy. Speech and speech production is modelled as follows: Speech consists of voiced and unvoiced sounds. Voiced sounds such as ‘a’ and ‘b’ are due to vibrations of the vocal chords and can be accurately modeled as the sum of sinusoids. Unvoiced sounds are, when looked at in a short time frame (say 10 ms), noise. They can be modeled as such. Voiced sounds are made by allowing air to pass freely over the vocal chords out of the vocal tract. Unvoiced sounds are made by a constricted vocal tract, producing turbulence. The vocal tract changes in time to produce voiced and unvoiced sounds in succession. This is speech. Speech can be modelled as a time varying signal. However since the vocal tract varies slowly with respect to the pitch of speech (the pitch of speech is typically close to 8 kHz), speech can be modelled as a LTI system (linear time invariant) where the transfer function of the vocal tract, v(t) for short time frames is convolved with a driving impulse train, x(t), producing sound, s(t). 13 Figure 9 - Magnitude response of the vocal tract 14 Figure 10 – The vocal tract as an LTI filter. [Ref: http://cobweb.ecn.purdue.edu/~ipollak/ee438/FALL04/notes/Section2.2.pdf] To model the vocal tract as a digital filter, the poles of its transfer function can be computed. Figure 11 - Poles near the unit circle correspond to large values of H(e^jw). The location of the poles depends on the resonant frequencies which occur in the vocal tract. These are called formant frequencies. 15 3.0 Database Database is constructed from the pre-recorded commands uttered by all the team members. In addition to the team members, few more people were also recorded to ensure more recognition and speaker independency. Database Design Each entry in the database is designed to correspond to a single command utterance. The entries in the database were designed to be a feature matrix containing features extracted from the pre-recorded samples. There was one entry for all the pre-recorded commands. Database construction There are in all 155 commands in the database. Five utterances per command for five commands and five team members gives 125 commands. The rest of the commands were recorded by external people. The entries in the database correspond to each command. The entries are in form of feature matrices extracted from the vocal commands. The formation of the feature matrices is summarized in the next section-‘Feature Extraction’. Feature Extraction The overview of the feature extraction is presented in the diagram below. The input files are either digitized first if not pre-recorded commands, and split into short-time frames. The ‘Cepstral coefficient extraction’ returns MFCC coefficients and the Frame Energy. These coefficients and energy is further processed to derive the Delta-Cepstral and Delta-delta Cepstral coefficients. 16 Figure 12 - Feature Extraction process The Cepstal coefficient extraction can be shown in plots in the following section. Cepstral Coefficient Extraction The block diagram of the cepstal coefficient extraction block is as shown below: Figure 13 - Cepstral coefficient calculation from frames As shown above, Fast fourier transform is applied to each of the frames of the digitized command. The next step is to calculate the frame energy. The fourier transform gives complex values out in its result. In order to make use of those values, they must first be converted to real values. The Absolute value of a complex number returns the magnitude of the complex numbers in the array in real numbers, and the real numbers are squared to calculate the energy. The magnitudes of the Fast fourier transform are plotted in the figure below: 17 Figure 14 - FFT Magnitude spectrum for a sample frame These magnitude squares are summed up to form frame energy as one of the parameters. The magnitude squares are also passed downstream to the MelFilter banks for further processing. The Mel Filter bank are filters designed based on the Mel Frequency scale. The Mel frequency scale is designed so that it represents the way human percept the sound. The frequency mapping in the Mel scale is shown with respect to the linear frequency below: 18 Figure 15 - MEL frequency vs Linear frequency The Mel Filter bank is pictured in the figure below. Figure 16 - Mel Filter bank magnitudes 19 The Discrete Cosine Transform (DCT) is applied on the signal after it is filtered by the Mel Filter bank and taken the log value of. The result of the DCT is the Mel-Frequency Cepstral Coefficients. The cepstral coefficients for a sample frame can be shown as below: Figure 17 - Mel Frequency Cepstral Coefficients for a sample frame Feature matrix and database formation In the project, there were in total 39 features per frame: 1. 12 MFCC coefficients 2. 1 Frame energy 3. 13 Delta-cepstral 4. 13 Delta-delta cepstral The feature matrix is a 2D matrix containing 39 rows for the number of features and ‘x’ number of columns for ‘x’ number of frames in the command. This feature matrix is than inserted into the database at a certain index that is mapped to the command input it was extracted from. Thus, the Database is a 3D matrix where each 2D matrix corresponds to a command. The block diagram below describes the structure of the Database. 20 Figure 17 - Database structure Repeating the feature extraction and storing process for all the command utterances, the final size of the database was 39x198x155. For the database 39 represents number of rows 198 represents number of frames (maximum) 155 represents number of commands Database comparison For any command input to the program, the features are extracted in the same way as the pre-recorded commands and stored to a feature matrix on a frameby-frame basis. The resulting feature matrix is a 2D matrix of features. To detect the command that was uttered, the feature matrix constructed from the input must be compared with the feature matrices inside the database. The algorithm used for the comparison of the feature matrices was Dynamic Time Warping (DTW). DTW has been described in detail in the background section. The flow-chart below describes the algorithm for the comparison part: 21 Figure 18 - Database comparison algorithm flow chart As shown in the diagram above, feature matrices are pulled off of the database individually and compared with the input feature matrix. An important note while retrieving matrices from the database is to consider the zero frames. The way MATLAB works with arrays and matrices is that it assigns the matrix dimensions dynamically, and it expands to accommodate any number of columns, as required. This causes a problem when there is a need to expand the array or the matrix. The question to be asked is what happens with the other elements which have less number of columns. There is only one size to the Database matrix, and that is the largest feature matrix. So, what happens to those feature matrices which have dimensions less than the dimensions provided by the Database matrix? What values would be seen in the empty cells of the smaller matrix? The answer is zeros. MATLAB appends zeros in the empty cells of the smaller feature matrices. The problems with padding of zeros are several: Corrupts the feature matrix with false values Adds unnecessary computations Comparison algorithm yields incorrect matches To have good recognition, the zeros need to be removed while comparing the feature matrices with the input feature matrix. The ‘Strip Zeros’ block in the algorithm does just that. It removes the zeros from the feature matrices before passing them on to the comparison algorithm. Thus, the comparision now yields reliable and accurate results. This was one of the main challenges that we came 22 across in the project. We were having very minimal accuracy rates until we corrected this. Once it was corrected, the recognition started working really well. Once the comparison is completed using DTW, the program returns a cumulative distance value. It is a representative distance of the difference between the two matrices. These cumulative distances with the input features are collected for all the feature matrices in the database. Once all the distances have been collected, the minimum distance among all comparisons is searched, and passed on as the matching feature matrix. The code then maps the index of the feature matrix to the command index, and hence the command is identified. The serial port is driven with the output associated with the command, which then controls the hardware. More details about the hardware interface can be found in the ‘Hardware’ heading under ‘Background’ section. 4.0 Dynamic Time Warping (DTW) In our project, speech signal is represented by a series of feature vectors which are computed every 10ms. A whole word will comprise dozens of those vectors, and we know that the number of vectors (the duration) of a word will depend on how fast a person is speaking. In speech recognition, we have to classify sequences of vectors. Therefore, we need to find a way to compute a distance between the unknown sequences of vectors X and known sequences of vectors W, which are prototypes for the words we want to recognize. Distance between Two Sequences of Vectors. A classification of a spoken utterance would be easy if we had a good distance measure D( X, W ) at hand. To get a good distance measure, the distance measure must Measure the distance between two sequences of vectors of different length While computing the distance, find an optimal assignment between the individual feature vectors Compute a total distance out of the sum of distances between individual pairs of feature vectors 23 Comparing the distance between two sequences of vectors of different length In dynamic time warping method (DTW), when comparing sequences with different length, the sequence length is modified by repeating or omitting some frames, so that both sequences have the same length as shown figure 1 below. This modification of sequences is called time warping. Figure 19 – Linear Time Warping [Ref:http://www.tik.ee.ethz.ch/~gerberm/PPS_Spracherkennung/SS06/lecture4.p df] As it can be seen from figure 1, the two sequences X and W consist of six and eight vectors, respectively. The sequence W was rotated by 90 degrees, so that the time index for this sequence runs from the bottom of the sequence to its top. The two sequences span a grid of possible assignments between the vectors. Each path through this grid (as the path shown in the figure) represents one possible assignment of the vector pairs. For example, the first vector of X is assigned the first vector of W, the second vector of X is assigned to the second vector of W, and so on. As an example the, let us assume that path P is given by the following sequence of time index pairs of the vector sequences. P = {(0, 0) , (1, 1) , (2, 2) , (3, 2) , (4, 2) , (5, 3) , (6, 4) , (7, 4)} (4.3) The length of path P is determined by the maximum of the number of vectors contained in X and W. The assignment between the time indices of W and X as given by P can be interpreted as” time warping” between the time axes of W and X. In our example, the vectors x2, x3 and x4 were all assigned to w2, thus warping the duration of w2 so that it lasts three time indices instead of one. By this kind of time warping, the different lengths of the vector sequences can be compensated. And for the given path P, the distance measure between the 24 vector sequences can be computed as the sum of the distances between the individual vectors. Finding the Optimal Path Once we have the path, computing the distance becomes a simple task. DTW distance can be computed efficiently by using Bellman’s principle of optimality. It states that If optimal path is the path through the matrix of grid points beginning at A and ending at B , and the grid point K is part of path, then the partial path from A to B is also part of optimal path. From that, we can construct a way of iteratively finding our optimal path P as shown on figure 2. Figure 20 - nonlinear path options [Ref:http://www.tik.ee.ethz.ch/~gerberm/PPS_Spracherkennung/SS06/lecture5.p df] According to this principle, it is not necessary to compute all possible paths P and corresponding distances to find the optimum path. Out of the huge number of theoretically possible paths, only a fraction is computed. To illustrate this concept further, we need to discuss what is called a local path alternative or local distance. Local distances Since both sequences of vectors represent feature vectors measured in short time intervals, we can restrict the time warping to reasonable boundaries. The first vectors of X and W should be assigned to each other as well as their last vectors. For the time indices in between, we want to avoid any big leap backward or forward in time, but want to restrict the time warping just to the reuse of the preceding vector(s) to locally warp the duration of a short segment of speech 25 signal. With these restrictions, we can draw a diagram of possible local path alternatives for one grid point and its possible predecessors. As we can see, a grid point (i, j) can have the following Predecessors: Figure 21 – Accumulated Distance in point (i, j) (i − 1, j) : horizontal local path (i − 1, j − 1) : diagonal local path (i, j − 1) : vertical local path [Ref:http://www.tik.ee.ethz.ch/~gerberm/PPS_Spracherkennung/SS06/lecture5.p df] All possible paths P, which we will consider as possible candidates for being the optimal path, can be constlructed as a concatenation of the local path alternatives as described above. According to the local path alternatives diagram shown above, there are only three possible predecessor paths leading to a grid point (i, j): The partial paths from (0, 0) to the grid points (i − 1, j), (i − 1, j − 1) and (i, j − 1) ). The (globally) optimal path from (0, 0) to grid point (i, j) can be found by selecting exactly the one path hypothesis among our alternatives which minimizes the accumulated distance A(i, j) of the resulting path from (0, 0) to (i, j). Starting from grid point (0,0) to the vector distances defined by the grid point (1,0) and (0,1), we can compute A(1,0) and A(0,1). Now we look at the points which can be computed from the three points we just finished. For each of these points (i, j), we search the optimal predecessor point out of the set of possible predecessors. That way we walk trough the matrix from bottom-left to top-right. Once we reached the top–right corner of our matrix, the accumulated distance A (RefFrames, TestFrames) is the distance D( W , X) between the vector sequences. 26 5.0 Experiments & Results In addition to the results noted on the presentation day, this section will also present results for the recognition rates for the team members and five external speakers with varied accents. On the presentation day, the project had more success with the ‘On’ and ‘Dark’ commands. The other commands suffered in the accuracy for external speakers. The other commands are ‘Fifteen’,’Twenty’ and ‘Twentyfive’. One reason for this huge difference in the recognition could be the length of the commands. ‘On’ and ‘Dark’ are short, and hence different accents and speech variation affect it less. However for the other longer commands, the pitch and the accents come in to picture and hence suffer in accuracy. On the presentation day: We had close to 100% accuracy for the speakers that the database was trained to We had about 80% accuracy for all speakers for the command ‘On’ We had about the same accuracy (80%) for the command ‘Dark’ for all speakers Many people did not try the other commands, but for some that did o ‘Fifteen’ had about 60% accuracy o ‘Twenty’ had about same (60%) hit rate o ‘Twentyfive’ was the worst with meager hit rate of about 40% For the report, it is planned to test the program in the following way: 10 repeats for all command by all the team members and in additions five external speakers o The hit rate will be treated as the accuracy For five commands, it is 50 repeats per speaker and for 10 speakers, total test samples will be 500 There will be results classified by commands, speaker type-external or database, and overall-including all speakers 27 6.0 Hardware Figure 22 - Hardware schematic Figure 23 - 5V DC to 120V AC wiring box schematic 28 Command name Fifteen Device Status Ascii Characters BCD Chip 1 input Relay input 15 c Atmel Pin PB0 PC0 PC1 PC2 1 0 0 PC3 1 PC4 0 PC5 1 Table 1 – Micro- controller signal route table. BCD Chip 2 input Twenty Twentyfive 2 LED's Display 20 25 w h 0 1 0 0 1 0 0 0 0 1 0 1 On Dark Light on o Light off f 1 0 The microcontroller (ATMEL) chip is programmed as the table above – Bits string will be sent from the DSP chip (or computer desktop through serial port during demonstration) in Ascii characters as shown on table. When the specific characters is recognised, for instance “o” , pin out PB0 will send a true signal to turn on the relay - thus the lamp will be turned on ; when “f” is detected, same pin out will send a false signal to turn off the lamp. 6 pin outs (PC0 to PC5) will be route to two BCDs. When characters “c”, “w”, or “h” is detected, the binary bits will operate as shown in table above to display the temperatures in 2 7-segment LEDs to display temperatures “15”, “20”, or “25”. 29 7.0 Conclusion The speech recognition was implemented in MATLAB using several features extractions, such as : cepstrum, log energy, delta cepstrum, and delta delta cepstrum numerical coefficients. Research was done on a few algorithms Neural Networks (NN), Hidden Markov Model (HMM), and Dynamic Time Warping (DTW). It has come to the conclusion that our speech recognition was performed using Dynamic Time Warping (DTW). Dynamic Time warping algorithm compares the features extracted from the live input signal and it was compared against the features stored in the database and if match of database is found, the output interface is triggered to produce serial of bits string as output. Matched commands are described in ascii character and bits strings were sent into the hardware interface and were driven by the Atmel chip to route the signal accordingly. Five commands - On, Dark, Fifteen, Twenty, and Twenty five were recorded in different samples (high pitch, low pitch, slow, fast, and normal) by various speakers with different accents for both males and females. Results on the presentation day was listed as below 100% accuracy for the speakers that the database was trained to 80% accuracy for the command ‘On’ 80% for the command ‘Dark’ had about 60% accuracy for the command ‘Fifteen’ 60% accuracy for the command ‘Twenty’ 40% accuracy for the command ‘TwentyFive’ However, these results are not final, another recording will be done soon to gather a larger database by having more speakers to have their commands recorded to increase the accuracy. For the final report, it is planned to test the program in the following way: 10 repeats for all command by all the team members and in additions five external speakers o The hit rate will be treated as the accuracy For five commands, it is 50 repeats per speaker and for 10 speakers, total test samples will be 500 30 8.0 References [1] Mohammed Waleed Kadous. “Dynamic Time Warping” http://www.cse.unsw.edu.au/~waleed/phd/html/node38.html (2002-12-10) [2] Eamonn J. Keogh† and Michael J. Pazzani‡ “Derivative Dynamic Time Warping” http://www.cs.ucr.edu/~eamonn/sdm01.pdf (2003 –10- 16) [3] Prof. Ilya Pollak “Speech Processing” http://cobweb.ecn.purdue.edu/~ipollak/ee438/FALL04/notes/Section2.2.pdf (2004 -09 -05) [4] Michael Gerber. “PPS on Speech recognition” http://www.tik.ee.ethz.ch/~gerberm/PPS_Spracherkennung/SS06 (2006 –01- 05) [5]Signal Modeling Techniques in Speech Recognition JOSEPH W. PICONE, SENIOR MEMBER, IEEE PROCEEDINGS OF THE IEEE, VOL. 81 (1993 – 09 – 09) [6] “Window Function”, http://en.wikipedia.org/wiki/Hamming_window (2006-12-12) [7] Minh N. Do, “An Automatic Speaker Recognition System” [8] B. Plannerer,”An Introduction to Speech Recognition” http://www.speech-recognition.de/textbook.html (2003 – 05-18) [9] Longbiao Wang, Norihide Kitaoka, Seiichi Nakagawa, “Robust Distant Speaker Recognition Based on Position Dependent Cepstral Mean Normalization”, 2005 [10] “Dynamic time warping”, 16 Dec 2006, http://en.wikipedia.org/wiki/Dynamic_time_warping [11] Stuart N Wrigley, “Speech Recognition by Dynamic Time Warping”, http://www.dcs.shef.ac.uk/~stu/com326/sym.html 31 [12] “Formant” http://en.wikipedia.org/wiki/Formant (2006-12-16) [13] “Signal Energy vs. Signal Power” http://cnx.org/content/m10055/latest/ (2004-08-12) [14] Lawrence R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition” Proceedings of the IEEE, 77 (2), p. 257–286, (1989-02-09) 32