VOICE RECOGNITION SYSTEM IN NOISY ENVIRONMENT Vinit D Patel

VOICE RECOGNITION SYSTEM IN NOISY ENVIRONMENT Vinit D Patel B.S., Parul Institute of Engineering and Technology, India, 2008 PROJECT Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in ELECTRICAL AND ELECTRONIC ENGINEERING at CALIFORNIA STATE UNIVERSITY, SACRAMENTO SPRING 2011 VOICE RECOGNITION SYSTEM IN NOISY ENVIRONMENT A Project by Vinit D Patel Approved by: __________________________________, Committee Chair Jing Pang, Ph.D. __________________________________, Second Reader Manish Gajjar, M.S., MBA ____________________________ Date ii Name of Student: Vinit D Patel I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the Project. __________________________, Graduate Coordinator Preetham Kumar, Ph.D. Department of Electrical and Electronic Engineering iii ________________ Date Abstract of VOICE RECOGNITION SYSTEM IN NOISY ENVIRONMENT by Vinit D Patel Human life could be much more comfortable, if global machinery works on voice commands. There are lot of products available in the market those provide us automated voice controlled working experience. Voice-identification is a kind of technology in which, machine responds to human voice. The goal of this project is to design and implement voice controlled embedded system in noisy environment. The project consists software and hardware design. The software part is to eliminate noise from the original signal and develop the speech recognition algorithm (DTW). The hardware is used for speech recognition. In this project, DTW algorithm is developed to study and to research the implementation of the speech recognition for single-word. In addition, in this project I implemented nearest neighbor algorithm followed by DTW, which helps to match the speech with different people’s accents. For really good result I used wiener filter to increase signal to noise ratio of applied signal. Software portion is fully developed in MATLAB environment. iv At the other end, hardware part was concentrated on comparing and processing of applied speech with pre-stored speech signal using HM2007. Also after the processing, output of the hardware controls the system according to the speech. _______________________, Committee Chair Jing Pang, Ph.D. _______________________ Date v DEDICATION I would like to dedicate this report to my mom. I have only been able to accomplish what I have thanks to her inspiration and sacrifice. She always said, “Work harder, God will be there to help you.” She lived that out every day of her life. Indira (aka Bharti) Patel (1963-2009) vi ACKNOWLEDGMENT This space provides me with a great opportunity to thank all the people without whom this project would never been possible. I would take this opportunity to convey my sincere thanks to all of them. First, I would like to thank you to Dr. Jing Pang, Associate Professor of Electrical and Electronic Engineering Department at California State University Sacramento for being my Project guide and guiding me when needed and always encouraging me to implement new things. She took the time out of her busy schedule to provide me with the valuable suggestions regarding project as well as for the project report. I would like to thank my Dad, whose constant tinkering, obsessive organization, and ingenuity gave me the mind and attitude of a true engineer. Special thanks to Mr. Manish Gajjar, Instructor of Electrical and Electronic Engineering Department at California state University Sacramento for being my second project guide. Also, like to thanks Dr. Preetham Kumar, Graduate Coordinator of Electrical and Electronic Engineering Department at California State University Sacramento for providing me this wonderful opportunity of gaining the best knowledge. I also want to thank all my friends, without whom, I would have never been able to complete my project on time. They were always there to help me and guide me whenever I had any doubts. Their constant support and encouragement helped me to work even harder. Without their continuous support, it would have been difficult for me to complete the project on time. vii TABLE OF CONTENTS Page Dedication……….…...……………………………...…………………………………....vi Acknowledgments…...……………………………...…………………………………...vii List of Tables …………………………………………………………………………......x List of Figures…………………………………………………………………………….xi Chapter 1. INTRODUCTION ……………………………………………………………...……..1 1.1 Overview……….…..…………………..……………………………….……….....1 1.2 Purpose of Project...……………..……………………………………….……...…2 1.3 Applications of Voice Recognition Embedded System………………….……...…4 2. SPEECH RECOGNITION………………..……………………………….……….......5 2.1 Classification of Speech Recognition System..………....……………….................5 2.2 Basic Composition of Speech Recognition System…...…….....…………………...6 2.3 Software Results………………………………….…...…….....…………………...8 3. WIENER FILTER………...….……………………………………………....………...9 3.1 Introduction……………….. …………..…………………………………………...9 3.2 Block Diagram and Mathematical Solution……..………………………………...10 3.3 Wiener Filters Input-Output Relationships…..…..………………………………..12 3.4 The Wiener-Hopf Technique……………..……………………………….............15 3.5 Wiener Filter Solution…………..………………………………………………...16 3.6 Simulation Result………………..………………………………………………...18 4. DYNAMIC TIME WARPING ……………………………………………………….19 4.1 Overview………………..…………………………..………………….………....19 4.2 DTW Algorithm…………….....………………………………………………….20 4.3 Mel-Frequency Cepstrum Coefficients...……..…………………………………..22 4.4 Applications……………..…...…………………………………………...............23 viii 4.5 Distance Matrix Coefficient Result….…………………………………...............24 5. K-NEAREST NEIGHBOR ALGORITHM…..……………………………………....25 5.1 Introduction………… ……..………………...……………….…………………..25 5.2 Assumptions in KNN………………...………..……………………….………… 27 5.3 KNN for Density Estimation…..…...………...……………….…………………..28 5.4 KNN Classification…………………………...……………….…………………..29 5.5 Implementation of KNN………..…………..……………………….………….…32 5.6 Results………………………..…………..……………………….………….……32 6. HARDWARE IMPLEMENTATION...……………………………………………….33 6.1 Overview………………..……………………………………………………….. 33 6.2 Block Diagram of Hardware……...…...………….…………...……………...…..36 6.3 Training and Working of HM-2007……...……….…………...……………...…..36 6.4 Microcontroller Interfacing with Motors...…..…………………………………...38 6.5 Limitation of Hardware…………...….………………...……...……………...…..41 7. CONCLUSION………………………………….….……………….………………...42 Appendix A Software Code.……………………………………….…………………….42 Appendix B Hardware Code.……………………………………….……………………53 Bibliography ..………………………………………………………….………………..55 ix LIST OF TABLES Page Table 1 List of 13 Speech Enhancement Algorithms Evaluated ..……….………..…….. 2 Table 2 Component List ………………………….……………..……….………..……..33 x LIST OF FIGURES Page Figure 1 Speech Recognition Flowchart …………...……………………………………. 7 Figure 2 Basic Diagram and Design of Project …...……….……………………………10 Figure 3 Wiener Filter Block Diagram ……………….…………………………………11 Figure 4 W Matrix Calculation for Single Microphone Case ….…………….………….13 Figure 5 Simulation Result of Wiener Filter……………………………………………..18 Figure 6 DTW Algorithm to Search Path………………....……………………………. 21 Figure 7 Euclidean Distances between Two Vectors Xr and Xs ………………………. 27 Figure 8 Circuit Schematic of SR-07…………………………...…...…………………...34 Figure 9 Block Diagram of Hardware…………………………...…...………………….36 Figure 10 Simple Darlington Transistor………………..……….……………………….39 Figure 11 Microcontroller Interfacing Circuit……………..……..….…………….…… 39 Figure 12 Pictures of Whole Project …………………..…………..….……………....... 40 xi 1 Chapter 1 INTRODUCTION The theme of social interaction and intelligence is important and interesting to an Artificial Intelligence and Robotics community [1]. It is one of the challenging areas in Human-Robot Interaction. Speech recognition technology is a great aid to admit the challenge and it is a prominent technology for Human-Computer Interaction for the future. Humans have five classical sensors: vision, touch, smell, taste and hearing, by which they percept the surrounding world. The main goal of this project is to use hearing sensor and the speech analysis to the embedded system. 1.1 Overview Voice Recognition Embedded System would be an advanced control system that uses human voice/audio speech to identify the speech command. The system would perform the action according to the given command. Today, many speech recognition systems and networks are available in market. Many of the companies formed speech recognition microprocessors too to use in such systems and networks. These kind of integrated circuitry use pre recorded speech signal as reference to recognize the current speech command. Such system has different performances in different environments like noisy and non-noisy. For more efficiency of the design, we have to make system for worst condition that could be achieved in noisy environment. For such efficient system, we can remove the noise from the speech command and then apply to the speech recognition system. 2 1.2 Purpose of Project In this project as a part of software, I developed nearest neighbor algorithm for Speech Recognition that is followed by Dynamic Time Wrapping (DTW) calculation. Still something is missing in this project. This project worked perfectly fine in non-noisy environment but it got some issues in noisy environment. Then I researched on noise reduction techniques of speech signal. I found such good paper on different-different algorithms of speech enhancement [2]. Algorithm Equation/Parameters Ref KLT pKLT MMSE-SPU logMMSE logMMSE-ne logMMSE-SPU pMMSE RDC RDC-ne MB WavThr Wiener_as AudSup Eq. 14,48 Eq. 34, v = 0.08 Eq. 7,51, q = 0.3 Eq. 20 Eq. 20 Eq. 2,8,10,16 Eq. 12 Eq. 6,7,10,14,15 Eq. 6,7,10,14,15 Eq. 4-7 Eq. 11,25 Eq. 3-7 Eq. 26,38, v(i)=1,2 iterations [8] [9] [10] [11] [11] [12] [13] [14] [14] [15] [16] [4] [17] Table 1: List of 13 Speech Enhancement Algorithms Evaluated [2] 3 After studying all of them, I decided to work on wiener filter which is easy to implement and more averagely efficient in all kinds of noises such as car noise, street noise, and bubble noise. This filtered input goes to DTW calculation, which helps to find out nearest neighbor of speech spoken. I implemented whole algorithm in MATLAB. Also in hardware part, I worked on one of speech recognition microprocessors HM 2007 to recognize the speech command. I applied the speech signal through microphone to HM 2007, and it compared the command with pre-recorded speech and gave the output accordingly. Microcontroller AT 89c52 processed the output signal of HM 2007 and controlled the motors of the embedded system. 4 1.3 Applications of Voice Recognition Embedded System Following would be the applications of the system under discussion: 1. Development of educational games and smart toys 2. No Key required for devices such as personal computer and laptops, automobiles, cell phones, door locks, smart card applications, ATM machines etc. 3. Support to disabled people 4. Alerts/warning signals during emergencies in airplane, train and/or buses. 5. Automatic payment and customer service support through telephones 5 Chapter 2 SPEECH RECOGNITION Speech Recognition is important for machine to understand human voices and perform the action according to human commands. Speech recognition is highly research object and it is useful in area of pattern recognition, involving physiology, psychology, linguistics, computer science and signal processing, and many other fields, even to the people involved in body language (such as when people speak expressions and gestures and other actions to help each other to understand the behavior). 2.1 Classification of Speech Recognition System Speech Recognition system, according to different points of view and the scope of different applications, has different performance requirements of the design. Their implementations are the following types: 1) Isolated words, conjunctions, continuous speech recognition, and speech understanding of the conversation systems 2) Large vocabulary and small vocabulary system 3) Specific and non specific speech recognition system 6 2.2 Basic Composition of Speech Recognition System A typical speech recognition program is shown in Figure 1. Input analog voice signal is first to go through preprocessing, which includes pre-filtering, sampling, quantization, windowing, endpoint detection, pre-emphasis and so on. Now next important part is feature extraction. Characteristics parameters of the requirements are: 1) Extract the characteristic parameters of the representative the representative voice 2) The order parameters have a good independent 3) Easy to calculate the characteristic parameters to chose high efficient method to ensure real-time implementation 7 Start Pre-recorded Speech Signals Enter Speech Wiener Filter (Eliminate Noise) NO Mel Function Capstrum Coefficient (Calculation) Create MAT Files (Reference Speech Signals) Comparing Enter Speech with Stored Speech Signals Reading MAT Files for Reference Speech Yes Result Display END Figure 1: Speech Recognition Flowchart 8 In this project, speech recognition software had been developed which using wiener filter, DTW and KNN algorithms. Wiener filter was using input signal from microphone. It was helpful to remove noise from the original signal. It also helped to increase signal to noise ratio. Now this filtered signal was applied to MFCC to create coefficients of mel frequency cepstrum. In this project, first reference file was created for different-different pre recorded speech signals. When the microphone input signal was applied, its MFC coefficients were compared to the pre-recorded speech’s MFC coefficients using DTW algorithm. Output scores of DTW were applied to the KNN algorithm to calculate the nearest common sound of the five different recorded speech signals. End of the software output was displayed on MATLAB output screen. Software would display correct speech command if applied microphone signal would be compared with pre-recorded signals. 2.3 Software Results Following cases are the MATLAB test result of the software: Case 1: Press any key to start 2 seconds of speech recording...Recording speech...Finished recording. System is trying to recognize what you have spoken... No microphone connected or you have not said anything. Case 2: Press any key to start 2 seconds of speech recording...Recording speech...Finished recording. System is trying to recognize what you have spoken... You just said Forward. 9 Chapter 3 WIENER FILTER 3.1 Introduction A practical aspect of signal processing shows that it is failed often to resolve problems to extract signal from noise. Therefore, we need to find the filter, which has socalled optimal linear filter characteristics [7]. So when both signal and noise applied to such filter, the output signal can be reproduced as accurately as possible with maximum noise suppression. Wiener Filtering is used to resolve such problems for a class of extracting single from noise. Wiener filter was introduced by Norbert Wiener in 1949 [3]. Also it was introduced independently for the discrete-time case by Kolmogorov [4]. WienerKolmogorov filters have the following assumptions: (a) signal and (additive) noise are stochastic processes with known spectral characteristics or known auto correlation and cross-correlation, and (b) the performance criterion minimizes the mean-square error. An optimal filter can be found from a solution based on scalar or multivariable methods [5]. The goal of the Wiener filter is to filter out noise that has corrupted the signal by statistical means [6]. 10 3.2 Block Diagram and Mathematical Solution A linear system which sample unit response is h (n) and, when you enter a random signal x(n) x(n)  s (n)  v(n) …………………………………... (3.1) where s(n) is signal, v(n) is Noise in signal. Now, output of linear system y(n) is y (n)   h(m) x(n  m) ………………………………. (3.2) m Project aim is to apply original noisy signal x(n) to linear system h(n) and received output signal y(n) is as close to s(n) which is our desired signal. So, estimated value of sˆ( n) is, y (n)  sˆ(n) …………………………………..... (3.3) So basic diagram and design of project is: x ( n)  s ( n)   ( n) h(n) y (n)  sˆ(n) Figure 2: Basic Diagram and Design of Project A Wiener filter block diagram as described by Haykin [8] is shown in fig.2. 11 d Desired Signal (M Dimensional Vector) + Sum - Error (M dimensional vector) Filter output (M dimensional vector) z y e W M x M filter matrix Input signal (M Dimensional Vector) Figure 3: Wiener Filter Block Diagram From figure 3, we can see that microphone signal y apply to the filter and the filter output z is our nearly desired signal. Still filtered signal z has some residual error, e  d  z  d  Wy ……………………..……….(3.4) Figure 3 shows that when z = d then e = 0. This means that when e = 0, z is the estimated value of d. Therefore, when a desired signal comes with noise, a selected W matrix is available for estimating the desired signal. Thus the goal of project part speech enhancement is to compute the W matrix [7]. 12 3.3 Wiener Filters Input-Output Relationships As shown in figure-1, the linear system h(n) called for an estimator of s(n). Now, sˆ & s(n) represents a estimated value and true value respectively, and e(n) s the error between them. Now from calculation it is clear that e(n) may be positive or may also be negative, and it is a random variable. Therefore, use it to express error mean square value is reasonable. So Statically mean value of minimum mean square error is:     E e 2 (n)  E (s  sˆ) 2 …………………………….. (3.5) Now we have known output of filter is: N 1 y (n)  sˆ(n)   h(m) x(n  m) ………………………… (3.6) m0 Error is: N 1 e(n)  s (n)  sˆ(n)  s (n)   h(m) x (n  m) ………….…….. (3.7) m0 So, mean square error: N 1   E e2 (n)   E ( s (n)   h(m) x(n  m)) 2  ………………… (3.8) m0   For filter impulse response h(m) m=0,1, ,N-1 , derivatives are: N 1   2 E ( s(n)   hopt (m) x(n  m)) x(n  j )   0 m0   j  0,1, 2 N  1 ……. (3.9) j  0,1, N  1 ……. (3.10) Further: N 1 E  s (n) x(n  j )    hopt (m) E  x(n  m) x(n  j )  m 0 13 Now for W matrix calculation of figure 1, we can take reference from the following figure: z Filter output (M dimensional vector) W y M x M filter matrix Input signal (M Dimensional Vector) Rxx Rxs VAD Figure 4: W Matrix Calculation for Single Microphone Case [7] Thus: N 1 Rxs ( j )   hopt (m) Rxx ( j  m) m0 j  0,1, 2, , N  1 ……………… (3.11) 14 Therefore, we can get Linear Equations: Rxs (0)  h(0) Rxx (0)  h(1) Rxx (1)   h( N  1) Rxx ( N  1)  j0  j 1 Rxs (1)  h(0) Rxx (1)  h(1) Rxx (0)   h( N  1) Rxx ( N  2)  ….. (3.12)    j  N  1 Rxs ( N  1)  h(0) Rxx ( N  1)  h(1) Rxx ( N  2)   h( N  1) Rxx (0) Written in Matrix Form as: Rxx ( N  1)   h(0)   Rxs (0)  Rxx ( N  2)   h(1)   Rxs (1)    ………. (3.13)         Rxx (0)   h( N  1)   Rxs ( N  1)  Rxx (1)  Rxx (0)  R (1) Rxx (0) xx     Rxx ( N  1) Rxx ( N  2) Simplified form of Matrix is: Rxx H  Rxs …………………………………….. (3.14) Where, H=[h(0) h(1) h(N-1)]' is the filter coefficients, Rxs   Rxs (0), Rxs (1), Rxs ( N 1) ' is the generated signal sequence coefficients, and Rxx (1)  Rxx (0)  R (1) Rxx (0) xx Rxx      Rxx ( N  1) Rxx ( N  2) Rxx ( N  1)  Rxx ( N  2)  is the autocorrelation matrix.   Rxx (0)  15 We can see that in the design process of the Wiener filter sought under the minimum mean square error or unit impulse response of the filter and transfer function expressions. Its essence is a Wiener-Hopf equation. In addition, the wiener filter design also has requirements of the signals and noise related functions. 3.4 The Wiener-Hopf Technique In 1930, Eberhard Hopf joined the Department of Mathematics at the Massachusetts Institute of Technology on a temporary contract with the help of Norbert Wiener. The collaboration between Wiener and Hopf was initiated their mutual interest in the differential equations governing the radiation equilibrium of stars [9]. In Wiener’s own words [10], “The various types of particle which form light and matter exist in a sort of balance with one another, which changes abruptly when we pass beyond the surface of the star. It is easy to set up the equations for this equilibrium, but it is not easy to find a general method for the solution of these equations.” Their collaboration in research resulted in the Wiener-Hopf technique as a means to solve:   k ( x  y) f ( y)dy  g ( x),0 x   ……………………………….(3.15) 0 The method proceeds by extending the domain of, or continuing, the integral equation (3.15) to negative real values of x. So,   g ( x),0  x,   k ( x  y) f ( y)dy  h( x),  x  0 …………………………… (3.16) 0 16 where h(x) is unknown. Now actual solution and full details can be found in the textbook by Noble [11], but the Fourier transformation of (15) then yields the typical Wiener-Hopf functional equation, G ( )  H ( )  F ( )  K ( ) ………………………….. (3.17) In which, H(α) and F(α) are half range Fourier transforms of the unknown functions h(x) and f(x) respectively. By contrast, G(α) and K(α) are also half range Fourier transform of known functions g(x) and k(x). 3.5 Wiener Filter Solution We can see that in equation (3.17), the key of the program is in known input signal and its auto correlation function. By solving Wiener-Hopf equation, we can get output signal cross-correlation, resulting in Wiener filter. Basic solution steps: 1. Initialized the values by : a. a(0)  rxd (0) / rxx (0) b. b(0)  rxd (1) / rxx (0) 2. For, j=1,2, ,M-1 , Make the following calculation: j 1 a. temp1  rxd ( j )   rxx ( j  i )a (i ) i 0 j 1 rxx (0)   rxx ( j  i )b(i ) i 0 b. a(i )  a(i )  temp 2  b(i ) i=0,1, j-1 17 c. a( j )  temp1 j 1 d. temp 2  rxx ( j  1)   rxx (i  1)b(i ) i 0 j 1 rxx (0)   rxx ( j  i )b(i ) i 0 e. b(i )  b(i  1)  temp 2  b( j  i ) i=1, j f. b(0)  temp 2 3. Filter response is : m a. h( j )  rxd ( M )   rxx (m  i ).a (i ) j 0 m rxx (1)   rxx (m  i ).b(i ) j 0 4. Using above calculation we can get Wiener filter output signal of applied input signal. 18 3.6 Simulation Results Figure 5: Simulation Result of Wiener Filter From figure 5, we can see that applied signal has low amplitude with higher noise. In the other end, wiener filter output signal has more strength compare to noise. This criterion helps to increase signal to noise ratio of applied signal. At last sound speaker of wiener filter output is more clear than original noisy signal. 19 Chapter 4 DYNAMIC TIME WARPING 4.1 Overview In this type of speech recognition technique the test data is converted to templates. The recognition process then consists of matching the incoming speech with stored templates. The template with the lowest distance measure from the input pattern is the recognized word. The best match (lowest distance measure) is based upon dynamic programming. This is called a Dynamic Time Warping (DTW) word recognizer. In order to understand DTW, two concepts need to be deal with,  Features: the information in each signal has to be represented in some manner.  Distances: some form of metric has be used in order to obtain a match path. There are two types: o Local : a computational difference between a feature of one signal and a feature of the other. o Global : the overall computational difference between an entire signal and another signal of possibly different length. Since the feature, vectors could possibly have multiple elements, a means of calculating the local distance is required. The distance measure between two feature 20 vectors is calculated using the Euclidean distance metric. Therefore the local distance between feature vector x of signal 1 and feature vector y of signal 2 is given by, d ( x, y )   (x  y ) i i 2 …………………………… (4.1) i 4.2 DTW Algorithm Speech is a time-dependent process. Hence, the utterances of the same word will have different durations, and utterances of the same word with the same duration will differ in the middle, due to different parts of the words being spoken at different rates. To obtain a global distance between two speech patterns (represented as a sequence of vectors) a time alignment must be performed. This problem is illustrated in figure 0, in which a ``time-time'' matrix is used to visualize the alignment. As with all the time alignment examples the reference pattern (template) goes up the side and the input pattern goes along the bottom. In this illustration the input ``SsPEEhH'' is a `noisy' version of the template ``SPEECH''. The idea is that `h' is a closer match to `H' compared with anything else in the template. The input ``SsPEEhH'' will be matched against all templates in the system's repository. The best matching template is the one for which there is the lowest distance path aligning the input pattern to the template. A simple global distance score for a path is simply the sum of local distances that go to make up the path. 21 Figure 6: DTW Algorithm to Search Path To make the algorithm and reduce excessive computation we apply certain restriction on the direction of propagation. The constraints are given below.  Matching paths cannot go backwards in time.  Every frame in the input must be used in a matching path.  Local distance scores are combined by adding to give a global distance. This algorithm is known as Dynamic Programming (DP). When applied to template-based speech recognition, it is often referred to as Dynamic Time Warping (DTW). DP is guaranteed to find the lowest distance path through the matrix, while minimizing the amount of computation. The DP algorithm operates in a time- 22 synchronous manner: each column of the time-time matrix is considered in succession (equivalent to processing the input frame-by-frame) so that, for a template of length N, the maximum number of paths being considered at any time is N. If D(i,j) is the global distance up to (i,j) and the local distance at (i,j) is given by d(i,j) D(i, j )  min[ D(i  1, j  1), D(i  1, j ), D(i, j  1)]  d (i, j ) ….……… (4.2) Given that D(1,1) = d(1,1) (this is the initial condition), we have the basis for an efficient recursive algorithm for computing D(i,j). The final global distance D(n,N) gives us the overall matching score of the template with the input. The input word is then recognized as the word corresponding to the template with the lowest matching score. 4.3 Mel-Frequency Cepstrum Coefficients In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely 23 than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound, for example, in audio compression. MFCCs are commonly derived as follows: [12] 1. Take the Fourier transform of (a windowed excerpt of) a signal. 2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows. 3. Take the logs of the powers at each of the mel frequencies. 4. Take the discrete cosine transform of the list of mel log powers, as if it were a signal. 5. The MFCCs are the amplitudes of the resulting spectrum. 4.4 Applications Using the above algorithm, we generate templates for the training data set. This set includes 5 utterances for each of the five command words: “Left", “Right", "Stop", "Forward" and "Backward". After the feature extraction is done on the test word, it is matched with all the templates and the minimum distance is calculated for each. The word is classified as one of the four depending on the global minimum distance. In this project, I developed the algorithm to use for controlling two motors of the toy car. 24 4.5 Distance Matrix Coefficient Result: Scores1 = 1.0e+004 * 0.5741 1.8575 3.7164 2.7567 3.2732 2.4946 2.1208 1.8565 3.4530 1.7783 1.6183 0.6201 5.4711 4.0024 4.5169 3.8422 3.025 1.439 5.5660 3.4090 3.5879 5.2763 0.4147 1.4435 2.4924 3.1520 2.5371 3.5035 1.0029 2.9692 2.2657 3.6250 1.0511 0.5487 2.1223 2.2573 1.8405 2.5638 1.0303 2.0788 2.4052 3.9130 2.1418 1.8048 0.4234 2.9629 2.6261 2.5600 2.1139 2.2589 2.2896 3.2425 3.0996 2.7853 3.0529 0.5002 2.2229 3.2317 2.7873 1.9719 2.1692 3.8311 3.1997 2.6173 2.9306 2.4412 0.6207 2.9675 2.6489 1.5591 1.9428 2.2470 4.0326 2.6934 3.1362 3.6682 3.4982 0.7567 4.3060 2.6786 3.2084 4.7374 1.0319 1.4739 2.2992 2.9104 2.2509 3.8464 0.4946 2.5945 2.2231 4.4386 2.9113 2.5731 2.7994 2.2883 1.3779 3.6557 2.2412 0.3803 With this coefficient result you can generate the sound which presents you matching result of speech recognition. Particular above result shows the coefficient generated for “Left” command, which can be played. In this project, this type of scores for different people’s speech would make a vector called Allscores. That vector index applied to KNN algorithm to calculate the final output index. 25 Chapter 5 K-NEAREST NEIGHBOR ALGORITHM 5.1 Introduction K Nearest Neighbor (KNN) is one of those algorithms that are very simple to understand but works incredibly well in practice. In addition, it is surprisingly versatile and its applications range from vision to proteins to computational geometry to graphs and so on. Most people learn the algorithm and do not use it much that is a pity as a clever use of KNN can make things very simple. It also might surprise many to know that KNN is one of the top 10 data mining algorithms. KNN is a non-parametric lazy learning algorithm. That is a concise statement. This is useful, as in the real world, most of the practical data does not obey the typical theoretical assumptions made (e.g. Gaussian mixtures, linearly separable etc). Nonparametric algorithms like KNN come to the rescue here. It is also a lazy algorithm. What this means is that it does not use the training data points to do any generalization. In other words, there is no explicit training phase or it is minimal. This means the training phase is fast. Lack of generalization means, that KNN keeps all the training data. More exactly, all the training data is needed during the testing phase. (Well this is an exaggeration, but not far from truth). Most of the lazy algorithms – 26 especially KNN – make decision based on the entire training data set (in the best case a subset of them). The dichotomy is obvious here – There is a nonexistent or minimal training phase but a costly testing phase. The cost is in terms of both time and memory. More time might be needed as in the worst case; all data points might take point in decision. More memory is needed as we need to store all training data. 27 5.2 Assumptions in KNN KNN assumes that the data is in a feature space. More exactly, the data points are in a metric space. The data can be scalars or possibly even multidimensional vectors. Since the points are in feature space, they have a notion of distance – This need not necessarily be Euclidean distance although it is the one commonly used. Figure 7: Euclidean Distances between Two Vectors Xr and Xs [13] d ( Xr, Xs)  Xr  Xs  ( Xr1  Xs1)2  ( Xr 2  Xs 2)2 …………… (5.1) Each of the training data consists of a set of vectors and class label associated with each vector. In the simplest case, it will be either + or – (for positive or negative classes). But KNN, can work equally well with arbitrary number of classes. 28 We are also given a single number "k”. This number decides how many neighbors (where neighbors are defined based on the distance metric) influence the classification. This is usually an odd number if the number of classes is two. If k=1, then the algorithm is simply called the nearest neighbor algorithm. 5.3 KNN for Density Estimation Although classification remains the primary application of KNN, we can use it to do density estimation also. Since KNN is non-parametric, it can do estimation for arbitrary distributions. The idea is very similar to use of parzen window. Instead of using hypercube and kernel functions, here we do the estimation as follows – For estimating the density at a point x, place a hypercube centered at x and keep increasing its size till k neighbors are captured. Now estimate the density using the formula, p( x)  k/n …………………………………… (5.2) V Where n is the total number of V is the volume of the hypercube. Notice that the numerator is essentially a constant and the volume influences the density. The intuition is this: Let’s say density at x is very high. Now, we can find k points near x very quickly. These points are also very close to x (by definition of high density). This means the volume of hypercube is small and the resultant density is high. Let’s say the density 29 around x is very low. Then the volume of the hypercube needed to encompass k nearest neighbors is large and consequently, the ratio is low. The volume performs a job similar to the bandwidth parameter in kernel density estimation. In fact, KNN is one of common methods to estimate the bandwidth (e.g. adaptive mean shift). 5.4 KNN Classification In this case, we are given some data points for training and a new unlabelled data for testing. Our aim is to find the class label for the new point. The algorithm has different behavior based on k. Case 1: k = 1 or Nearest Neighbor Rule This is the simplest scenario. Let x be the point to be labeled. Find the point closest to x. Let it be y. Now nearest neighbor rule asks to assign the label of y to x. This seems too simplistic and sometimes even counter intuitive. If you feel that this procedure will result a huge error, you are right – but there is a catch. This reasoning holds only when the number of data points is not very large. If the number of data points is very large, then there is a very high chance that label of x and y is same. An example might help – suppose that you have a (potentially) biased coin. You toss it for 1 million time and you have head 900,000 times. Then most likely, your next call will be head. 30 Now, assume all points are in a D dimensional plane. The number of points is reasonably large. This means that the density of the plane at any point is high. In other words, within any subspace there is adequate number of points. Consider a point x in the subspace which also has many neighbors. Now let y be the nearest neighbor. If x and y are sufficiently close, then we can assume that probability that x and y belong to same class is same – Then by decision theory, x and y have the same class. The book "Pattern Classification" by Duda and Hart has an excellent discussion about this Nearest Neighbor rule. One of their striking results is to obtain a tight error bound to the Nearest Neighbor rule. The bound is P*  P  P *(2  Where c P*) …………………………… (5.1) c 1 is the Bays error rate, c is the number of classes and P is the error rate of Nearest Neighbor. The result is indeed very striking (at least to me) because it says that if the number of points is large then the error rate of Nearest Neighbor is less than twice the Bays error rate. Case 2: k = K or k-Nearest Neighbor Rule This is a straightforward extension of 1NN. What we do is that we try to find the k nearest neighbor and do a majority voting. Typically, k is odd when the number of classes is 2. Let us say k = 5 and there are 3 instances of C1 and 2 instances of C2. In this 31 case, KNN says that new point has to label as C1 as it forms the majority. We follow a similar argument when there are multiple classes. One of the straightforward extensions is not to give 1 vote to all the neighbors. A very common thing to do is weighted KNN where each point has a weight which is typically calculated using its distance. For e.g. under inverse distance weighting, each point has a weight equal to the inverse of its distance to the point to be classified. This means that neighboring points have a higher vote than the farther points. It is obvious that the accuracy might increase when you increase k but the computation cost also increases. 32 5.5 Implementation of KNN The algorithm on how to compute the K-nearest neighbors is as follows: 1. Determine the parameter K = number of nearest neighbors beforehand. This value is all up to you. 2. Calculate the distance between the query-instance and all the training samples. You can use any distance algorithm. 3. Sort the distances for all the training samples and determine the nearest neighbor based on the K-th minimum distance. 4. Since this is supervised learning, get all the Categories of your training data for the sorted value which fall under K. 5. Use the majority of nearest neighbors as the prediction value. 5.6 Results After implementing this algorithm, I completed whole speech recognition software. I tried for particular my voice and it worked 100% efficiently. In other case, I tried with my roommate and friends voices and it worked around 85% successfully. Therefore, the average success ratio of this algorithm is around 92%. 33 Chapter 6 HARDWARE IMPLEMENTATION 6.1 Overview The first goal of this project was to create a circuit implementing speech voice controlled robot car that utilized stand-alone hardware to perform the functions instead of the more commonly used software. The HM2007 IC and all the other components comprising this circuit were assembled and wired on designed PCB. Table 2 below shows the parts list used in creating this circuit. Components Value Quantity HM 2007 SR-7 kit N/A AT89c52 N/A ULN 2003 N/A 78LS05 N/A Capacitors 0.1 uF Capacitors 100 uF Capacitors 0.22 uF Resistors 220 ohms Resistors 6.8 Kohms Resistors 330 ohms Oscillator (XTAL) 1.5 MHz Relays 12 V Motors 12 V, 200 RPM Table 2: Component List 1 1 1 2 1 1 2 2 1 3 1 4 2 34 All of the above components were purchased from Images Scientific Instruments, Inc. The circuit of SR-7 kit is shown in Figure 8 below, Figure 8: Circuit Schematic of SR – 7 [14] 35 The microphone and the keypad consist of the only user interfaces with the circuit. The microphone is a standard PC microphone, which acts as the transducer converting the pressure waves to an electrical signal. The microphone is coupled to the HM2007 IC, which is attempting to classify each word into the different trained categories. The keypad consists of 12 normally open momentary contact switches. These were soldered onto a printed circuit board (PCB) which was used to communicate with the HM2007 IC. The keypad allowed the user to train the system and clear the memory. The circuit outputs consist of the two 7-Segment Displays and the LED. The 7-Segment Displays show any error codes, show the target being trained, and the final classification by the HM2007 system. As designed in the circuit, the top display is the most significant, and the bottom is the least significant. For example, the number 9 would show a 0 on the top display and a 9 on the bottom display. Only 01 through 09 was used for this project. The LED is connected to the HM2007 IC and is used to show the status of the HM2007 IC. When the LED on, the system is listening and will classify all incoming sounds. When the LED is off, the system has been placed in training mode, and when the LED flashes, it indicates that the word spoken was just successfully trained and placed into memory. When the circuit is turned on, the HM2007 checks the static RAM. If everything checks out the board displays "00" on the digital display and lights the red LED (READY). It is in the "Ready" waiting for a command. 36 6.2 Block Diagram of Hardware Motor-1 Voice Input (Microphone) HM 2007 SR-07 Kit Speech Processing Logic Microcontroller (AT89C52) Generating Relay Outputs Darlington Array ULN 2003 Chip Increasing Driving Strength Relay Network 4x4 Driving Motors Motor-2 Figure 9: Block Diagram of Hardware 6.3 Training and Working of HM-2007 To train the circuit begins by pressing the word number you want to train on the keypad. The circuit can be trained to recognize up to 40 words. Use any numbers between 1 and 40. For example, press the number "1" to train word number 1. When you press the number(s) on the keypad the red led will turn off. The number is displayed on the digital display. Next press the "#" key for train. When the "#" key is pressed, it signals the chip to listen for a training word and the red led turns back on. Now speak the word you want the circuit to recognize into the microphone clearly. The LED should blink off shortly; this is a signal that the word has been accepted. Continue training new words in the circuit using the procedure outlined above. Press the "2" key then "#" key to train the second word and so on. The circuit will accept up to 37 forty words. You do not have to enter 40 words into memory to use the circuit. If you want, you can use as many word spaces as you want. The circuit is continually listening. Repeat a trained word into the microphone. The number of the word should be displayed on the digital display. For instance if the word "directory" was trained as word number 25. Saying the word "directory" into the microphone will cause the number 25 to be displayed. 38 The chip provides the following error codes: a. 55 = word too long b. 66 = word too short c. 77 = word no match 6.4 Microcontroller Interfacing with Motors Now the output of HM-2007 Kit goes to Microcontroller’s port-1 inputs. Now microcontroller is deciding, which action has to perform and it generating output on port P3 according to command. So on P3 port you can find the output of microcontroller and input for motor driving. Still this output is not able to drive motors directly. For that, we need some driving circuit, which generates high DC voltage to drive 12V 200 rpm motors. Therefore, Microcontroller output is drive by ULN 2003 IC. This IC is build up of seven Darlington Arrays. Now the Darlington transistor (often called a Darlington pair) is a compound structure consisting of two bipolar transistors (either integrated or separated devices) connected in such a way that the current amplified by the first transistor is amplified further by the second one. Overall gain of Darlington Array: βdarlington = β1 . β2 39 Figure 10: Simple Darlington Transistor This Kind of Darlington transistor is helpful to drive relay network, which controlled the Motors. In figure 11 shows the circuit diagram of microcontroller interfacing circuit. Figure 11: Microcontroller Interfacing Circuit 40 From Above circuit if you combined this with HM2007 SR-07 kit you can control two motors according to your voice commands. In this particular project, I trained the SR-07 kit for five commands: Left, Right, Forward, Backward and Stop. So it generating BCD output which goes to Microcontroller and it is decide what output generate for relay to drive motors. The whole project is created on small metal box and made it as small toy car. Figure 12 showing the picture of same project I made. Figure 12: Pictures of Whole Project 41 6.5 Limitation of Hardware According to testing results, this hardware is just working for one-person voice who trained the kit. In addition, it is not working in noisy environment. Some time some different echo voice also not recognized by the kit. Because of this error, you cannot control motors on proper instant. We can limit this limitation by developing high efficient algorithm such as nearest neighbor algorithm. As earlier discussed in this project, we can apply output of the MATLAB code to microcontroller interfacing circuit so might be it would work more efficiently. 42 Chapter 7 CONCLUSION This project shows that KNN algorithm followed by DTW is efficient algorithm of speech recognition. In addition, the wiener filter can increase signal to noise ratio of applied signal. This filtered signal is enhancing the efficiency of Speech Recognition. Also many major companies using this kind of technique in their customer service systems. In addition, the hardware of this project shows that normal DTW can use for smaller task and you can enhance this project through better filter and algorithm database. Moreover, in forward of this project, you can load the source code of the software design in higher end microcontroller like ATMEGA-32, and its output can drive microcontroller interface for efficient way. 43 APPENDIX A Software Code 1) Recognition_Main: clc; nc = 13; %Required number of mfcc coefficients N = 5; %Number of words in vocabulary k = 3; %Number of nearest neighbors to choose fs=16000; %Sampling rate duration1 = 0.15; %Initial silence duration in seconds duration2 = 2; %Recording duration in seconds G=2; %vary this factor to compensate for amplitude variations NSpeakers = 5; %Number of training speakers fprintf('Press any key to start %g seconds of speech recording...', duration2); pause; silence = wavrecord(duration1*fs, fs); fprintf('Recording speech...'); speechIn = wavrecord(duration2*fs, fs); % duration*fs is the total number of sample points fprintf('Finished recording.\n'); fprintf('System is trying to recognize what you have spoken...\n'); speechIn1 = [silence;speechIn]; %pads with 150 ms silence speechIn2 = speechIn1.*G; speechIn3 = speechIn2 - mean(speechIn2); %DC offset elimination speechIn = WienerFilter(speechIn3); %Applies spectral subtraction rMatrix = mfcc(nc,speechIn,fs); %Compute test feature vector Sco = DTWCalc(rMatrix,N); [SortedScores,EIndex] = sort(Sco); K_Vector = EIndex(1:k); Neighbors = zeros(1,k); %computes all DTW scores %Sort scores increasing %Gets k lowest scores %will hold k-N neighbors %Essentially, code below uses the index of the returned k lowest scores to %determine their classes for t = 1:k u = K_Vector(t); for r = 1:NSpeakers-1 if u <= (N) 44 break else u = u - (N); end end Neighbors(t) = u; end %Apply k-Nearest Neighbor rule Nbr = Neighbors; %sortk = sort(Nbr); [Modal,Freq] = mode(Nbr); %most frequent value Word = strvcat('Stop','Left','Right','Forward','Backward'); if mean(abs(speechIn)) < 0.01 fprintf('No microphone connected or you have not said anything.\n'); elseif ((k/Freq) > 2) %if no majority fprintf('The word you have said could not be properly recognised.\n'); else fprintf('You have just said %s.\n',Word(Modal,:)); %Prints recognized word end 2) Creat Refrence File: clc; nc=13; Ref1 = cell(1,5); Ref2 = cell(1,5); Ref3 = cell(1,5); Ref4 = cell(1,5); Ref5 = cell(1,5); %Required number of mfcc coefficients for j = 1:5 q = ['\SpeechData\1\5_' num2str(j) '.wav']; [speechIn1,FS1] = wavread(q); %disp(FS1); speechIn1 = WienerFilter(speechIn1); Ref1(1,j) = {mfcc(nc,speechIn1,FS1)}; %MFCC coefficients are %computed here end for k = 1:5 q = ['\SpeechData\2\5_' num2str(k) '.wav']; [speechIn2,FS2] = wavread(q); 45 %disp(FS2); speechIn2 = WienerFilter(speechIn2); Ref2(1,k) = {mfcc(nc,speechIn2,FS2)}; end for l = 1:5 q = ['\SpeechData\3\5_' num2str(l) '.wav']; [speechIn3,FS3] = wavread(q); %disp(FS3); speechIn3 = WienerFilter(speechIn3); Ref3(1,l) = {mfcc(nc,speechIn3,FS3)}; end for m = 1:5 q = ['\SpeechData\4\5_' num2str(m) '.wav']; [speechIn4,FS4] = wavread(q); %disp(FS4); speechIn4 = WienerFilter(speechIn4); Ref4(1,m) = {mfcc(nc,speechIn4,FS4)}; end for n = 1:5 q = ['\SpeechData\5\5_' num2str(n) '.wav']; [speechIn5,FS5] = wavread(q); %disp(FS5); speechIn5 = WienerFilter(speechIn5); Ref5(1,n) = {mfcc(nc,speechIn5,FS5)}; end %Converts the cells containing all matrices to structures and save %structures in matlab .mat files in the working directory. labels = {'Stop','Left','Right','Forward','Backward'}; s1 = cell2struct(Ref1, labels, 2); save Vectors1.mat -struct s1; s2 = cell2struct(Ref2, labels, 2); save Vectors2.mat -struct s2; s3 = cell2struct(Ref3, labels, 2); save Vectors3.mat -struct s3; s4 = cell2struct(Ref4, labels, 2); save Vectors4.mat -struct s4; s5 = cell2struct(Ref5, labels, 2); save Vectors5.mat -struct s5; 3) Function Wiener Filter: 46 function Speechout=WienerFilter(Speechin) d=Speechin; d=d*8; d=d'; %Enhanced Speech Signal Strength So we can generate the Noise signal of proper elements %fq=fft(d,8192); % Discreat Fourier Transform %t = 0:1:36062; %subplot(3,1,1); %f=Fs*(0:4095)/8192; %Fixed Frequency Spectrum %plot(f,(y(1:4096))); %plot(t,y); %title ('Original voice signal in frequency domain graphics'); %xlabel ('frequency F'); %ylabel ('FFT'); %[m,n]=size(d); %x_noise=randn(1,n); %x=d+x_noise; %After joining the noise of speech signals, noise is (0,1) distribution of Gaussian white noise x = d; %fq=fft(x,8192); %subplot(3,1,2); %plot(f,(x(1:4096))); %plot(t,x); %title ('Speech Signal with Noise in frequency domain graphics'); %xlabel ('frequency F'); %ylabel('FFT'); Rxxcorr=xcorr(x(1:4096)); size(Rxxcorr); A=Rxxcorr(4096:4595); %Rxx of the wienerfilter Rxdcorr=xcorr(d(1:4096),x(1:4096)); size(Rxdcorr); B=Rxdcorr(4096:4595); %Rxd of the wienerfilter M=500; %Length of the Filter Speechout=wienerfilter1(x,A,B,M); %Denoising using Wiener filtering %Result = Result/6; %Enhanced Result Signal %fq=fft(Result); %subplot(3,1,3); %f=Fs*(0:4095)/8192; %plot(f,(Result(1:4096))); %plot(t,Result); %title ('Wiener filtering voice signal in frequency domain'); %xlabel ('frequency F'); %ylabel('FFT'); end 47 function y=wienerfilter1(x,Rxx,Rxd,N) %For Wiener filtering %x is an input signal,Rxx is the input signal's autocorrelation vector %Rxx is an input signal and ideal signal cross-correlation of vector,n is the length of the Wiener filter %Output y is an input signal by Wiener filter for Wiener filtering output h=wienersolution(Rxx,Rxd,N);%solution of Wiener filter coefficient t=conv(x,h); %Filter Lh=length(h); %Get length of the filter Lx=length(x); %Get the length of the input signal y=t(double(uint16(Lh/2)):Lx+double(uint16(Lh/2))-1);%output sequence y and the length of the input sequence xlength of the same %y = t; end 4) Wiener_Solution: function h = wienersolution(A,B,M) %Solution of wiener-hopf equation %A (RXX) is autocorrelation for received signal vector as Rxx (0), Rxx (1), ... ..., Rxx (M-1) %B (Rxd) is to receive signals, and there is no noise interference signal cross correlation vector to Rxd (0), Rxd (1), ... ..., Rxd (M-1) %M is the length of the filter %H saved filter coefficients %For example A=[6,5,4,3,2,1]; B=[100,90,120,50,80,200]; M=6; %Solution h=[26.4286-20.0000 50.0000-50.0000-45.0000 81.4286]' T1=zeros(1,M);%T1 storage of intermediate vector equation solution T2=zeros(1,M);%T2 storage of intermediate vector equation solution T1(1)=B(1)/A(1); T2(1)=B(2)/A(1); X=zeros(1,M); for i=2:M-1 temp1=0; temp2=0; for j=1:i-1 temp1=temp1+A(i-j+1)*T1(j); temp2=temp2+A(i-j+1)*T2(j); end X(i)=(B(i)-temp1)/(A(1)-temp2); for j=1:i-1 X(j)=T1(j)-X(i)*T2(j); 48 end for j=1:i T1(j)=X(j); end temp1=0; temp2=0; for j=1:i-1 temp1=temp1+A(j+1)*T2(j); temp2=temp2+A(j+1)*T2(i-j); end X(1)=(A(i+1)-temp1)/(A(1)-temp2); for j=2:i X(j)=T2(j-1)-X(1)*T2(i-j+1); end for j=1:i T2(j)=X(j); end end temp1=0; temp2=0; for j=1:M-1 temp1=temp1+A(M-j+1)*T1(j); temp2=temp2+A(M-j+1)*T2(j); end X(M)=(B(M)-temp1)/(A(1)-temp2); for j=1:M-1 X(j)=T1(j)-X(M)*T2(j); end h=X; end 49 5) Mel_Frequency Cepstrum Coefficient: function REF=mfcc(num,s,Fs) n=512; %Number of FFT points Tf=0.025; %Frame duration in seconds N=Fs*Tf; %Number of samples per frame fn=24; %Number of mel filters l=length(s); %total number of samples in speech Ts=0.01; %Frame step in seconds FrameStep=Fs*Ts; %Frame step in samples a=1; b=[1, -0.97]; %a and b are high pass filter coefficients noFrames=floor(l/FrameStep); %Maximum no of frames in speech sample REF=zeros(noFrames-2, num); %Matrix to hold cepstral coefficients lifter=1:num; %Lifter vector index lifter=1+floor((num)/2)*(sin(lifter*pi/num));%raised sine lifter version if mean(abs(s)) > 0.01 s=s/max(s); end %Normalises to compensate for mic vol differences %Segment the signal into overlapping frames and compute MFCC coefficients for i=1:noFrames-2 frame=s((i-1)*FrameStep+1:(i-1)*FrameStep+N); %Holds individual frames Ce1=sum(frame.^2); %Frame energy Ce2=max(Ce1,2e-22); %floors to 2 X 10 raised to power -22 Ce=log(Ce2); framef=filter(b,a,frame); %High pass pre-emphasis filter %v = fix(N); v = hamming(N); v = v'; F=framef.*v; %multiplies each frame with hamming window FFTo=fft(F,N); %computes the fft FFTo = FFTo'; melf=melbankm(fn,n,Fs); %creates 24 filter, mel filter bank halfn=1+floor(n/2); spectr1=log10(melf*abs(FFTo(1:halfn)).^2);%result is mel-scale filtered spectr=max(spectr1(:),1e-22); c=dct(spectr); %obtains DCT, changes to cepstral domain c(1)=Ce; %replaces first coefficient coeffs=c(1:num); %retains first num coefficients ncoeffs=coeffs.*lifter'; %Multiplies coefficients by lifter value 50 REF(i, :)=ncoeffs'; end %assigns mfcc coeffs to succesive rows i %Call the deltacoeff function to compute derivatives of MFCC %coefficients; add all together to yield a matrix with 3*num columns d=(dcoeff(REF)).*0.6; %Computes delta-mfcc d1=(dcoeff(d)).*0.4; %as above for delta-delta-mfcc REF=[REF,d,d1]; %concatenates all together 6) Dtw Algorithm Calculation function AllScores = DTWCalc(rMatrix,N) %Vectors to hold DTW scores Scores1 = zeros(1,N); Scores2 = zeros(1,N); Scores3 = zeros(1,N); Scores4 = zeros(1,N); Scores5 = zeros(1,N); %Load the reference templates from file s1 = load('Vectors1.mat'); REFall1 = struct2cell(s1); s2 = load('Vectors2.mat'); REFall2 = struct2cell(s2); s3 = load('Vectors3.mat'); REFall3 = struct2cell(s3); s4 = load('Vectors4.mat'); REFall4 = struct2cell(s4); s5 = load('Vectors5.mat'); REFall5 = struct2cell(s5); %Compute DTW scores for test template against all reference templates for i = 1:N REF1 = REFall1{i,1}; Scores1(i) = DTW(REF1,rMatrix); end for j = 1:N REF2 = REFall2{j,1}; Scores2(j) = DTW(REF2,rMatrix); end 51 for m = 1:N REF3 = REFall3{m,1}; Scores3(m) = DTW(REF3,rMatrix); end for p = 1:N REF4 = REFall4{p,1}; Scores4(p) = DTW(REF4,rMatrix); end for q = 1:N REF5 = REFall5{q,1}; Scores5(q) = DTW(REF5,rMatrix); end AllScores = [Scores1,Scores2,Scores3,Scores4,Scores5]; function [cost] = DTW(featureMatrix,RefMatrix) F = featureMatrix; R = RefMatrix; [r1,c1]=size(F); %test matrix dimensions [r2,c2]=size(R); %reference matrix dimensions localDistance = zeros(r1,r2);%Matrix to hold local distance values %The local distance matrix is derived below for n=1:r1 for m=1:r2 FR=F(n,:)-R(m,:); FR=FR.^2; localDistance(n,m)=sqrt(sum(FR)); end end D = zeros(r1+1,r2+1); %Matrix of zeros for local dist matrix D(1,:) = inf; %Pads top with horizontal infinite values D(:,1) = inf; %Pads left with vertical infinite values D(1,1) = 0; D(2:(r1+1), 2:(r2+1)) = localDistance; %This loop iterates through distance matrix to obtain global %minimum distance 52 for i = 1:r1; for j = 1:r2; [dmin] = min([D(i, j), D(i, j+1), D(i+1, j)]); D(i+1,j+1) = D(i+1,j+1)+dmin; end end cost = D(r1+1,r2+1); %returns overall global minimum score 7) Dc Coefficient: Function Diff = Dcoeff(X) [Nr,Nc] = Size(X); K = 3; %Number Of Frame Span(Backward And Forward Span Equal) B = K:-1:-K; %Vector Of Filter Coefficients %Pads Cepstral Coefficients Matrix By Repeating First And Last Rows K Times Px = [Repmat(X(1,:),K,1);X;Repmat(X(End,:),K,1)]; Diff = Filter(B, 1, Px, [], 1); % Filter Data Vector Along Each Column Diff = Diff/Sum(B.^2); %Divide By Sum Of Square Of All Span Values % Trim Off Upper And Lower K Rows To Make Input And Output Matrix Equal Diff = Diff(K + [1:Nr],:); 53 APPENDIX B Hardware Code 1) Code For Microcontroller: #include<reg51.h> #define voice P2 sbit RELAY1=P1^0; sbit RELAY2=P1^1; sbit RELAY3=P1^2; sbit RELAY4=P1^3; void main() { P1=0X00; while(1) { if(voice==0XF1) //Generating Relay Combination for Left { RELAY1=1; RELAY2=0; RELAY3=1; RELAY4=0; } else if(voice==0XF2) //Generating Relay Combination for Right { RELAY1=0; RELAY2=1; RELAY3=0; RELAY4=1; } else if(voice==0XF3) //Generating Relay Combination for Forward { RELAY1=1; RELAY2=0; RELAY3=0; RELAY4=1; } 54 else if(voice==0XF4) //Generating Relay Combination for Backward { RELAY1=0; RELAY2=1; RELAY3=1; RELAY4=0; } else if(voice==0XF5) { RELAY3=1; } else if(voice==0XF6) { RELAY3=0; } else if(voice==0XF7) { RELAY4=1; } else if(voice==0XF8) { RELAY4=0; } } } 55 BIBLIOGRAPHY [1] Kerstin Dautenhahn (2005), “Social Intelligence and Interaction in Animal Robots”, SSAISB Convention, UK pp 1-3 [2] Yi Hu, P. C. Loizou (2006), “Subjective comparison of speech enhancement Algorithms”, IEEE 2006 International conference, USA pp 1-5 [3] Wiener N (1949), “Extrapolation Interpolation and Smoothing of stationary Time Series”, Wiley, USA pp 10-15 [4] Kolmogorov A (1941), “Interpolation and Extrapolation of stationary random sequences”, Mathematic series, USA pp 3-14 [5] Barrett J, Moir T (1987), “A Unified Approach to Multivariable, Discrete Time filtering based on the Wiener Theory”, Kyberbetika 23 pp 177-197 [6] Brown R, Hwang P (1996), “Introduction to Random Signals and Applied Kalman Filtering”, 3rd John Wiley & Sons, New-York [7] Z. Qi and T. Moir (2007), “An Adaptive Wiener Filter for an Automotive Application with Non-Stationary Noise”, 2nd International Conference on Sensing Technology, USA pp 300-305 [8] Haykin S (2002), “Adaptive Filter Theory”, 4ed Prentice Hall, Englewood Cliffs [9] Jane B and David A (2007), “A Brief Historical Perspective of the Wiener-Hopf Techniques”, University of Manchester Springer, USA pp 351-356 . [10] Wiener N (1956), “I am a Mathematician”, Doubleday & co. Inc, Garden city,USA [11] Jones DS (1952), “A Simplifying Technique in the Solution of a Class of Diffraction Problems”, Quart J Math 3 pp 189-196 [12] R. Plomp, L. C. Pols and J. P. Van de Geer (1967), “Dimensional Analysis of Vowel Spectra”, J Acoustical Society of America pp 707-712 [13] Thomas B (2008), “K-Nearest Neighbors Algorithm, Southern Methodist University”, Texas pp 1-5 [14] SR-06/SR-07 Speech Recognition Kit, Images SI Inc., Staten Island, NY, http://www.imagesco.com/speech/speech-recognition-index.html

VOICE RECOGNITION SYSTEM IN NOISY ENVIRONMENT Vinit D Patel

Related documents

Products

Support

VOICE RECOGNITION SYSTEM IN NOISY ENVIRONMENT Vinit D Patel

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib