VOICE RECOGNITION SYSTEM BASED ON AUDIO FINGERPRINTING Mantej Singh Sahota B.E., Punjab Technical University, India, 2007 PROJECT Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in ELECTRICAL AND ELECTRONICS ENGINEERING at CALIFORNIA STATE UNIVERSITY SACRAMENTO Fall 2010 VOICE RECOGNITION SYSTEM BASED ON AUDIO FINGERPRINTING A Project by Mantej Singh Sahota Approved by: __________________________________, Committee Chair Jing Pang, Ph.D. __________________________________, Second Reader Preetham Kumar, Ph.D. ____________________________ Date ii Student: Mantej Singh Sahota I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the Project. __________________________, Graduate Coordinator _______________________ Preetham Kumar, Ph.D. Date Department of Electrical and Electronics Engineering iii Abstract of VOICE RECOGNITON SYSTEM BASED ON AUDIO FINGERPRINTING by Mantej Singh Sahota Voice recognition is the ability of a machine or a program to receive and interpret dictation, or to understand the spoken words. In the recent years, an interest has been received by systems for audio fingerprinting based voice recognition systems which enable automatic content-based identification by extracting the audio signatures from the signal and matching them to the fingerprint of the word to be recognized. This project discusses the implementation of a hardware based, real-time voice recognition system which is capable of storing the fingerprints and later recognizing them for the three words of the user’s choice. This is achieved by implementing band pass filters in the assembly language with fixed-point arithmetic onto to AT Mega32 microcontroller. The outputs of the filters are squared and accumulated. This approach helps in saving a lot of time for the speech sample to be processed for the frequency spectrum before the next samples comes in. The analysis of the voice is made by calculating the Euclidean distance between the saved and the current fingerprints of the words. This technique provides high success rate in recognizing the word. The algorithm for the voice recognition system is written in assembly and C programming language. iv The successful demonstration of the voice recognition system was presented on AT Mega16 + 232 kit with AT Mega32 microcontroller. _______________________, Committee Chair Jing Pang, Ph.D. _______________________ Date v ACKNOWLEDGEMENT Special thanks to Dr. Jing Pang for guiding me throughout my project and encouraging me to implement new ideas. I would also like to thank Dr. Preetham Kumar, graduate coordinator of Electrical and Electronics Engineering department at California State University for proof reading my desertion and providing me with best lab facilities and latest equipment for testing. vi TABLE OF CONTENTS Page Acknowledgement………………………………………………………………………..vi List of Tables……………………………………………………………………………...x List of Figures…………………………………………………………………………….xi Chapter 1. INTRODUCTION……………….………………………………………......…...1 1.1 Introduction to Voice Recognition…………………………………...…….…1 1.2 Purpose of the Project ………………….……………………………………..1 1.3 Significance of the Project ……………………………………………………2 1.4 Report Organization…………………………………………………………...2 2. BASICS OF VOICE RECOGNITION AND AUDIO FINGERPRINTING…….4 2.1 Voice Recognition …………………...…………………………………….....4 2.2 Methods of voice recognition ……….…………………….…………….…....4 2.2.1 Template Matching Approach…………………….…………………..5 2.2.2 Feature Analysis Approach……………………………………………6 2.3 Concept of Audio Fingerprinting ……………………………………….….…8 2.4 Properties of Audio Fingerprint Based Voice Recognition System…………10 3. ATMEL MEGA32 OVERVIEW..…………………..…………………...……...12 3.1 AT Mega32 ADC ……………………………………………………..……..12 3.2 EEPROM Data Memory..………………………………..…………………..15 vii 3.2.1 EEPROM Read/Write Access Registers……………………………..15 3.2.2 EEPROM Usage in Voice Recognition System……………………..16 3.3 USART………………………………………………………………………16 3.4 Timer/Counter Registers…….……………………………………………….18 4. PROJECT DETAILS…………………………………………………………….20 4.1 Overview……………………………………………………………………..20 4.2 Design Implementation Details………………………………………………20 4.2.1 Microphone…………………………………………………………..21 4.2.2 High Pass Filter………………………………………………………21 4.2.3 Amplifier Stage………………………………………………………22 4.2.4 ATMEL Mega16 + 232 and AT Mega 32 Processor………………...23 4.2.5 LED Circuitry……………...………………………………………...24 4.3 Design Strategy………………………………………………………………25 4.3.1 Filter Design……………………………………………………….…26 4.3.2 Fingerprint Generation…………………………………………….…27 4.3.3 Fingerprint Calculation………………………………………………27 4.3.4 Initial Threshold Calculation………………………………………...29 4.3.5 Software Flow………………………………………………….…….29 5. SIMULATIONS AND DESIGN IMPLEMENTATION ANALYSIS……….…32 5.1 Speech Frequency Analysis ………………………………………….……...32 5.2 Filter Design and Fingerprint Generation Analysis…………………….……34 viii 5.2.1 MATLAB Filter Design and Fingerprint Generation…………….….34 5.2.2 Actual Filter Design and Fingerprint Generation……………………37 5.3 Fingerprint Comparison Analysis……………………………………………39 6. CONCLUSION……………………………………….…………………………41 Appendix Source Code…...…………………………………………………….….…….43 References………………………………………………………………………………..51 ix LIST OF TABLES 1. Table 3.1: USBR Setting for Commonly Used Frequencies…………………….18 2. Table 5.1: Band Pass Filter 200-400 Hz First 2nd Order Coefficients…………...38 3. Table 5.2: Band Pass Filter 200-400 Hz Second 2nd Order Coefficients……..…38 x LIST OF FIGURES 1. Figure 2.1: Voice Recognition Based on Template Matching……………..……..6 2. Figure 2.2 Feature Extraction Model for Voice Recognition……………………..7 3. Figure 2.3: Clips and Frames used in Feature Analysis…………………………...7 4. Figure 2.2: Spectrogram of a Word “one”……………………………………….10 5. Figure 3.1: AT Mega32 Block Diagram…………………………………............13 6. Figure 3.2: ADC Auto Trigger Logic……………………………….…...………14 7. Figure 4.1: Voice Recognition System Block Diagram…………………...……..20 8. Figure4.2: Microphone with Amplification Circuitry……….…………...…...…21 9. Figure 4.3: Schematic of Microphone Amplification Circuit……………………22 10. Figure 4.4: ATMEL Mega32 Developer’s Kit………………………....….……23 11. Figure 4.5: LED Circuitry…………………………………..………….……...…24 12. Figure 4.6: Schematic of AT Mega32 Connected with LED and Microphone Circuitry ………………………………………………………..…………….….25 13. Figure 4.7: Software Flow for Voice Recognition System………………………30 14. Figure 5.1: Signal Variation of “Hello”……………………...…………………..33 15. Figure 5.2: Signal Variation of “One”………...…………………………………34 16. Figure 5.3: Fingerprint Accumulation after Every 250 Samples for a Word……35 17. Figure 5.4: Fingerprints of Words “Back” and “History”…………………...…..37 18. Figure 5.5: Fingerprints of the Word “Hello at Different Intervals of Time”...…39 19. Figure 6.1 – Hyper Terminal Screen Shot……………………………………….42 xi 1 Chapter 1 INTRODUCTION 1.1 Introduction to Voice Recognition The term “voice recognition” is used to refer to a recognition system that can be trained to a particular speaker as is the case for most desktop recognition software. The voice recognition is performed on the basis of frequency content in a voice. In order to achieve the frequency contents, several samples of the same sound are averaged during the training phase. These averaged samples are referred to as the fingerprints. The frequency content of a sound can be compared to the averaged stored fingerprints by treating them as vectors and calculating the difference between them. If the distance is close enough to the reference fingerprint then it is considered to be a match [1]. 1.2 Purpose of the Project The purpose of the project is to implement a voice recognition algorithm on AT Mega32 microcontroller in an efficient manner. The system should be capable of storing the fingerprints of three words of user’s choice in the EEPROM of AT Mega32 microcontroller as a dictionary and then later using the stored fingerprints to recognize the spoken word by implementing the template matching (fingerprint matching) algorithm. The fingerprint matching is achieved by calculating the Euclidean distance between the fingerprints of the dictionary word and the spoken word. The minimum distance with the relative word in the dictionary is recognized as the spoken word. The process comprises of three steps. The steps are initial threshold (noise) calculation, filter 2 implementation, fingerprint generation, storing the fingerprint and lastly fingerprint comparison. 1.3 Significance of the Project The project helps in understanding the implementation of the voice recognition algorithm on a hardware platform. From the hardware prospective, the project gives an introduction to the AT Mega16 programmer kit and AT Mega32 microprocessor implementation of real-time voice recognition system. The project gives a basic idea of implementing an optimized filter design in assembly language. The project can be elaborated in future and used in various applications like voice based security systems, voice based car navigation systems and in various other applications. Algorithms for audio fingerprint generation, filter implementation and template matching using Euclidean distance were tested in code vision AVR complier and AT Mega32 microcontroller effectively, and were written in assembly and advanced C programming. All the results were also simulated in MATLAB. 1.4 Report Organization Chapter two describes the basics of voice recognition and audio fingerprinting technology. It gives an overview of the methods for voice recognition. The methods include “template matching” and “feature extraction”. Chapter three gives an overview of AT Mega 32 microprocessor used for voice recognition. Chapter four gives the details of the implementation of the project both from hardware and software prospective. 3 Chapter five presents the implementation and the MATLAB simulations analysis for the audio fingerprint generation and fingerprint matching algorithm, using Euclidean distance. Chapter six provides the conclusions of the project, challenges, limitations of the project and the future work associated with it. The references are provided in Bibliography section. The program for simulation and hardware implementation are provided in Attachment section. 4 Chapter 2 BASICS OF VOICE RECOGNITION AND AUDIO FINGERPRINTING 2.1 Voice Recognition Voice Recognition is the technology by which the human voice is converted in to the electrical signals. These electrical signals are converted into the coding patterns which already have predefined meaning assigned to it. The meaning is nothing but the database of the fingerprints (frequency samples) of the words already stored in the memory of the voice recognition system. The voice recognition is also known by another name called speech recognition. For voice recognition most of the focus is on the human voice. The reason is that humans most naturally use their voice for most part of the communication [2]. The computer systems which are capable of voice recognition are designed with such precision that they produce a specific type of response on receiving a voice signal. As we know that even the fingerprint of the same word spoken by a same person at different instance of time is never the same. Also, each human voice is different and even the same words can have different meaning if they are spoken in different context. To overcome this, over the years several approaches have been implemented with different degrees of success [1]. 2.2 Methods of Voice Recognition There are various approaches for the voice recognition but most commonly it is divided into two categories: “template matching” and “feature analysis”. The most accurate and simple approach out of these two is template matching, if implemented 5 correctly. It also has some limitations though. In the template matching approach the first step for the user to speak a word or a phrase into a microphone. 2.2.1 Template Matching Approach The electrical signals from a microphone are analog so these are converted into the digital signal with the help of analog to digital converter. This digitized output is stored in the memory. Now, to determine the meaning of this input voice, the computer attempts to match it with a sample or a template that has some kind of a knowing meaning. This is just like input commands that are sent by the keyboard. The voice recognition system already has a stored template and it just attempts to match the current template to the stored one. The template is usually also known a fingerprint (frequency sample). It is a fact that no two humans can have the same voice. This is also true that a voice recognition system cannot contain a template for each potential user [1]. This means that the system needs to be trained with new user’s voice input before that user can use the system for voice detection purpose. During the training session, a user has to speak a word several times into the microphone. Each time the user speaks the word the program in the voice recognition system displays that word on to the screen. Once the user feels comfortable with the system and words spoken by him/her are displayed correctly then the training session ends. There is one limitation with this approach and that is the system is limited to the vocabulary. This type of a system is called “speaker dependent”. The system with good amount of memory can have hundreds of words and 6 even short phrases and the recognition accuracy is almost 98 percent [2]. Figure 2.1 shows the template matching process. Figure2.1: Voice Recognition Based on Template Matching 2.2.2 Feature Analysis Approach Another form of voice recognition systems is available which are based on the feature extraction approach. These systems are usually speaker independent. Instead of finding an exact or a near match to the actual voice template and the stored template, this method first processes the voice using the Fast Fourier Transforms or the Linear Predictive Coding. In the next step the system tries to find the similarities between the expected inputs and the actual digital voice inputs. Now, with this approach the system will find the similarities that are present for good range of speakers and thus the system need not to be trained by the user before using it for voice detection purpose [2]. Figure 2.2 shows a feature extraction model for voice recognition. There are many ways of characterizing an audio signal. Mostly, the audio signals are categorized into two domains: time-domain and frequency-domain features. This can be explained by taking an example in which an audio sample at 22 KHz is divided into 7 Figure 2.2 Feature Extraction Model for Voice Recognition clips of one second long. Feature analysis is done on each clip by calculating a feature vector for each clip. These features are calculated on the basis of frame level features. The frame level features are computed by overlapping short intervals known as frames. Each frame contains 512 samples which are shifted by 128 samples from the previous frame. Figure 2.3 shows the relationship between the clips and the frames. Figure 2.3: Clips and Frames used in Feature Analysis The differences between the template matching and the feature extraction technique are that the feature extraction technique can recognize a speech that is spoken in different accents and varying speeds of speech delivery, pitch and volume. The 8 template matching technique is completely lags these features. The implementation of the speaker independent systems is a difficult task with some of the greatest hurdles like different accents and inflections used by the speakers around the world with different nationalities. This is the reason that the accuracy of the speech independent system drops to 90 percent as compared to speech dependent voice recognitions system with 98 percent accuracy. There is another way of differentiating the voice recognition systems by determining the kind of speech they can recognize. The speech can be different words, connected words and it can also be continuous. The simplest system out of these systems to implement is different word systems. For this the user has to take breaks in between the words to be recognized. This gives enough time to the system to process the spoken word. On the contrary the continuous speech systems are the most difficult to implement. The reason for that is that continuous speech involves the words running into each other without any significant pauses. This gives very little time to the system to process the spoken words. 2.3 Concept of Audio Fingerprinting An audio fingerprint is nothing but the samples of frequencies of speech or voice, which summarizes the whole recoding. In the recent years, the audio fingerprinting technology has evolved and has gained a lot, as it allows recognizing the audio irrespective of its format. The audio fingerprinting technology is known by different names like pattern matching, multimedia information retrieval system and cryptography [3]. 9 Audio fingerprinting works on a principle of extracting audio frequency samples (fingerprints) from an audio stream and comparing these samples with the database of pre stored samples and hence leading to voice detection. This process is not as simple as it seems to be. The biggest challenge in this case is giving the correct result of pattern matching. The reason is that it the fingerprints of many words are closely matched so it becomes a challenging for the system to detect the correct match. It just gives the result on a perfect match bases which may be wrong. An approach that one can think of to get the appropriate result is by matching the whole speech but that is not an effective and efficient method [3]. The other efficient approach could be cyclic redundancy check but this way includes the compression of the binary formats and one flip of a bit here and there can give completely absurd results. The human speech can be analyzed by looking at different intensity frequencies in the voice with respect to time. Figure 2.1 shows the spectrogram of a word “one”. The x axis represents time and the y axis represents frequency and the black spots shows the intensity of the word. If the spectrum is observed closely then one can see that the intensity of the word “one” lies in between time 0.6 and 0.9 seconds. The red area in the spectrogram represents high energy levels and the lower energy levels are represented by shades of yellow and orange. 10 Figure 2.2: Spectrogram of a word “one”. 2.4 Properties of Audio Fingerprint Based Voice Recognition System The audio fingerprint based voice recognition system should have the following properties [3, 4]: Robustness – The fingerprint obtained from a noisy version of audio should be similar to that of the original audio. This requires the system to be really fast and precise. Pair wise independence – The audio fingerprints of a word spoken by the same speaker at different instances of time should be same. However, this is not the case. Every time a person speaks the same word at different instances of time, the intensity is different each time and hence the fingerprints are not the same. Quick database search – A voice recognition system designed for a practical application then it should be able to search a database quickly which is having a large number of fingerprints. Versatility – The voice recognition system should have an ability to extract and detect the audio irrespective of the audio format. 11 Reliability – Methods used by the voice recognition system to access a query is also important as it may lead to copyright violations in case of song recognition. Fragility – The voice recognition system may be used for applications in which the recognition of the content integrity of the audio is important. The voice recognition system should be capable to detect the changes in the audio as compared to the original one. 12 Chapter 3 ATMEL MEGA32 OVERVIEW AT Mega 32 is a general purpose low power CMOS 8-bit microcontroller. It is based on AVR RISC architecture. The key features of this microcontrollers includes up to 16 MIPS throughput at 16 MHz, 32 Kbytes of flash program memory, 2 Kbyte of internal SRAM, Kbyte of EEPROM, two 8 bit times, 8 channel 10 bit A/D converter, USART interface, JTAG interface etc. Figure 3.1 shows the block diagram of AT Mega32 [5]. 3.1 AT Mega32 ADC The Analog to Digital Converter in AT Mega32 is based on 10 bit successive approximation. The ADC is connected to an 8 channel multiplexer. This arrangement allows 8 single ended voltage inputs that are constructed from the pins of port A. The ADC converts the analog input to 10 bit digital output through successive approximation. The analog inputs and the differential gain values can be selected to writing particular bits to the ADMUX register [5]. The ADMUX register is an 8 bit register. The ADC is enabled by selecting ADC bit named ADEN in ADCSRA register. If it is set one then the ADC is enabled otherwise it is disabled. The 10 bit ADC result is presented by the ADC data registers. The name of these registers is ADCH and ADCL. If the ADC is used in the left adjust result mode and not more than 8 bit precision is required, then it is sufficient to read ADCH register only. 13 14 Figure3.1: AT Mega32 Block Diagram A single conversion is started in ADC by writing logic 1 to bit ADC Start Conversion Bit. The bit stays high as long as the conversion is completed. In between the conversion, if the data channel is changed, ADC finishes the current conversion before making a move to the next one. The conversion can also be started automatically. The automatic mode is initiated by setting an ADC Auto Trigger Enable bit. The source that provides the trigger can also be selected by setting the ADC trigger select bits. If the trigger bit is set even when the conversion has been completed, the new conversion does not start. In this situation, an interrupt is generated which is needed to be cleared before the new conversion attempt is made again. Figure shows the ADC auto trigger logic. If the auto trigger mode is enabled, the single conversion can be started by writing ADSC in ADCSRA to one. The ADSC is also used to determine the progress of the conversion. Figure3.2: ADC Auto Trigger Logic An analog input source that is applied to ADC (0-7) has pin capacitance as well as input leakage of that pin particular pin. This happens even when the pin is not selected 15 for input. The analog to digital converter is optimized for analog signals like microphone output with an output impedance of approximately 10 KΩ. If a source matching to this configuration is used then the sampling time is negligible. If a source with higher impedance is used then the sampling time depends on the time taken by the capacitor associated with ADC to charge. For the purpose of this project, the ADC is used to convert the analog output of the microphone to the digital value. The value of the register ADMUX used was 00100000. Bits 7 and 6 were zero which means that internal Vref is turned off. Bit 5 which is ADC left adjust result bit. This bit was 1 which means ADC left adjust result is turned on. The bits 4:0 are the analog channel and gain selection bits. These bits are used to select the pin from where the ADC gets the analog input. The value of all the bits was taken zero to select the input at ADC0 [5]. 3.2 EEPROM Data Memory AT Mega32 contains 1 KB of Electrical Erasable Programmable Read Only Memory. It is a separate data space in AT Mega32. The EEPROM can be written and erased 100,000 times. 3.2.1 EEPROM Read/Write Access Registers The EEPROM registers are accessible in the I/O space. EEPROM have varying access times. If the user’s code contains instructions to write the EEPROM then some precautions must be taken. If a heavily filtered power supply is used then the Vcc rises and falls slowly on power up and down. A specific order of instructions must be followed while writing to the EEPROM. The CPU is halted for two clock cycles when a write 16 attempt is made to EEPRM. When a read attempt is made, CPU is halted for four clock cycles before the next set instructions are executed. The various registers that correspond to EEPROM are EEPROM address register, EEPROM data register and EEPROM control register. The EEPROM address register is a 16 bit wide register and the bits from 15:10 are reserved in it. The bits 9:0 specify the address in the 1024 bytes EEPROM space. The data bytes can be addressed linearly from 0 to 1023. The initial value for this register must be defines before it can be used. The EEPROM data register is 8 bit wide. This register contains the data that is to be written to the EEROM in the address that is given by the EEPROM address register. The EEPROM control register is 8 bit wide register. Bits 7:4 are reserved for future use. Bit 3 is the ready interrupt enable bit. Bit 2 is EEPROM master write enable bit. This bit is used to determine whether setting the bit EEWE to 1 caused the EEPROM to be written or not. Bits 1 and 0 are EEPROM write enable and read enable bits. These bits are set and reset according to read and write requirements to EEPROM. 3.2.2 EEPROM Usage in Voice Recognition System In the voice recognition project, the EEPROM was used for writing the fingerprints of three words which correspond to the 75.2 % usage of the memory. The EEPROM was erased by using the pin 7 of the port C. Whenever, the input at the pin 7 of port C was high, and EEPROM was erased completely. Due to the limited space of the EEPROM only three words were stored. 17 3.3 USART The Universal Synchronous and Asynchronous serial Receiver and Transmitter is a serial communication port that is integral part of AT Mega32. This port is used to print the microcontroller messages on the HyperTerminal. Also, this port acts as a serial console to transfer data between the AT Mega32 and the peripheral devices. It has different modes of operation and the baud rate can be adjusted according to the user’s requirement by witting some values to the UBRRL register. All these values are available in AT Mega32 datasheet according to different frequencies. The table 3.1 shows some of the values of the UBRRL register as per the clock frequency used. The UBRRL is a 16 bit register. The bit 15 is a regular select bit and the bits from 14:12 are reserved bits. The bits 11:0 are used to specify the value for the USART baud rate as per the table given in the AT Mega32 data sheet. For the purpose of the voice recognition project, the value of the UBRRL register was 103. This value means that the system clock is 16 MHz and the baud rate used is 9600. Accordingly, the HyperTerminal was also adjusted to the corresponding value. The USART was used to print the information like peripheral initialization, word detected and the Euclidean distance calculated between the stored fingerprint and the fingerprint of the spoken word. The transmission and reception of USART was set by writing 0x18 values to the register UCSRB. 18 Table 3.1: USBR Setting for Commonly Used Frequencies 3.4 Timer/Counter Registers There are two timer/counter available in AT Mega32 namely Timer/Counter0 and timer/Counter1 Presclaers. These are usually used to set the prescale values. Timer/Counter0 and Timer/Counter1 shares the same prescaler module but the Timer/Counter0 can have different prescaler settings. The internal clock source provides 16 MHz clock frequency. By using appropriate values in the timers, the prescaled clock can have a frequency of fclk/8, fclk/64, fclk/256 or fclk/1024 [5]. Special function I/O 19 register is one of the registers that is associated with the timers. It is an 8 bit register. Bit 0 is a prescaler reset timer/counter1 and timer/counter bit. If this bit is set then the prescaler gets reset. As both the timers share the same prescaler, so the reset of it affects both the timers. Thus this bit is always read as zero. The registers which are associated with the timer 1 are TCNT1, OCR1A/B and ICR1. All these registers are 16 bit registers. There are some special procedures that are needed to be followed while accessing these registers. The procedures are explained in detail in the AT Mega32 data sheet. The Timer/Counter can be clocked internally with the help of a prescaler or by an external clock source connected to pin T1. The double buffered OCR1A/B registered is always compared to the value in the Timer/Counter. The result of their match can be used to generate a PWM waveform. The result also sets one of the bit named compare match flag which can be used to generate an interrupt. In the voice recognition project, different values of the timers were used at different intervals of time, to get different clock frequencies. To sample the ADC input at 4000 Hz, the value of the register TCCR0 was 00001011 and the counter register TCNT0 was 0. The value of the output control register (OCR0) was set 64. This means that when the counter value in TCNT0 started incrementing and became equal to 64, the ADC input was sampled at 4032 Hz which is close to the sampling frequency 4000 Hz. 20 Chapter 4 PROJECT DETAILS 4.1 Overview The implementation of the project included various fields of electrical engineering like analog design, digital design, digital signal processing and systems programming in C. Figure 4.1 shows the block diagram of the voice recognition system. Figure 4.1: Voice Recognition System Block Diagram 4.2 Design Implementation Details The hardware design for the voice recognition system was divided into the following five parts:- 21 4.2.1 Microphone The microphone used in the project is a two pin microphones. One of the pin is connected to ground and the other one is for the output. The operating power is supplied to the microphone from the Vcc terminal of ATMEGA16 + 232 board. Pull up resistors has been used to avoid damaging the microphone. The microphone has a response of 9 KHz which makes it suitable for the voice recognition system. It is a known fact that most of the first and second harmonics of the human speech lies close to 2 KHz. The output of the microphone is around 1.93 mV. Figure 4.2 shows the microphone with an amplification circuitry Figure4.2: Microphone with Amplification Circuitry 4.2.2 High Pass Filter The output from the microphone is passed through high pass filter. This analog RC filter is having .22 uF capacitor and a 2 KΩ resistor. The cutoff frequency of the capacitor is nearly 160 Hz and the noise added by the other electronic components is 60 22 Hz as the power operation in America is at 60 Hz. The value of 160 Hz is fairly close to the lower limit of the human speech frequency. This filter also cutoff the unnecessary noise of 60 Hz. Another advantage of using this RC filter is to prevent the dc bias from going to the op amp. 4.2.3 Amplifier Stage The op amp used for the amplification stage is LM358p. The LM358p has a very good slew rate of 0.3V/μs and the response of the op amp for the input signal from microphone is better than the other op-amps. The formula used for the calculation of the gain (Va) is Va = Feedback resistance/ input resistance. 1M / 2K = 1, 000000/2,000 = 500. The same was verified with the voltmeter. The output of the op-amp comes out to be nearly 4 V which is good enough for the words to be recognized by the ATMEL Mega32. Figure 4.3 shows the circuit diagram of amplification stage. Figure 4.3: Schematic of Microphone Amplification Circuit 23 4.2.4 ATMEL Mega16 + 232 and AT Mega 32 Processor The developer kit used for the project is ATMEL mega16 + 232 with ATMEL MEGA32 processor. Figure 4.4 shows the ATMega16 + 232 developer kit. The key features of the kit include a USART connector for printing the messages on the hyper terminal, ISP connector and four I/O ports with a JTAG interface for the boundary scan. The assembly and C code is converted into a hex file and downloaded to the AT Mega32 flash memory using AVR dude (hardware module) via ISP connector present on the kit. The board is responsible for the providing 5V power to the LED and the microphone amplification circuitry. The microphone signal from the amplification circuitry connects to the pin PA0 of the ATMEL Mega32 microcontroller. Figure 4.4: ATMEL Mega32 Developer’s Kit 24 4.2.5 LED Circuitry The LED circuitry consists of six LEDs. Three of the LEDs are green and represents the stored words in EEPROM and three are red LEDs. One of the red LEDs glows when a spoken word fingerprint matches with the fingerprints of the stored in the EEPROM. 2 KΩ resistors are connected with each LED for current limiting purpose. Figure 4.5: LED Circuitry All the LEDs connect to the port B of AT Mega 32 microcontroller. Figure 4.5 shows the LED circuitry. The figure 4.6 shows the full schematic of the board, microcontroller, the amplification circuitry with microphone and the LED circuitry connected to it. The EEPROM is erased using a switch that is connected to the pin 7 of the Port C. 25 Figure 4.6: Schematic of AT Mega32 Connected with LED and Microphone Circuitry 4.3 Design Strategy The sampling frequency required to sample the spoken word is calculated based on the Nyquist theorem. It states that - “To avoid aliasing, the minimum sampling rate should be equal to or more than the highest frequency within the information signal”. The sampling frequency of 4000 Hz for the voice recognition system is optimal as the human voice frequency varies between 0 to 4 KHz. The second and the third harmonics of the human speech are close to 2 KHz. 26 4.3.1 Filter Design For the speech analysis, eight digital IIR filters are used. These eight 4th order Chebyshev Band Pass Filter have a stop band ripple of 40 db. The Chebyshev filters are preferred over the other filters because these filters provide sharp transitions after the cutoff frequency which is necessary during the speech analysis. The band pass filters used for the filter design are:1st BPF - 200 Hz to 400 Hz 2nd BPF - 400 Hz to 600 Hz 3rd BPF - 600 Hz to 800 Hz 4th BPF - 800 Hz to 1000 Hz 5th BPF - 1000 Hz to 1200 Hz 6th BPF - 1200 Hz to 1400 Hz 7th BPF - 1400 Hz to 1600 Hz 8th BPF - 1600 Hz to 1800 Hz The Low pass filter (< 200 Hz) was neglected was because of very high noise levels. To design 4th order filters, two second order Chebyshev filters are cascaded using the "Direct Form II Transposed" implementation of a difference equations. y1(n) = b11*x1 + b12*x1(n-1) + b13*x1(n-2) – a11*y1(n-1) – a12*y1(n-2) ………..(4.1) y2(n) = b21*x2 + b22*x2(n-1) + b23*x2(n-2) – a21*y2(n-1) – a22*y2(n-2) ………..(4.2) yout = g * y2(n) ………..(4.3) 27 In the equations a and b are the filter coefficients. The values of these coefficients can be calculated using MATLAB and that id described in chapter 5. 4.3.2 Fingerprint Generation The first part was to calculate the threshold value above which the input can be recognized as a word. The calculation of the threshold value was done by reading the input of ADC to a temp variable adc_in using a timer 0 and summing up the values 256 times, which was considered as a part of calculating noise value. The value from the adc_in was passed through eight 4th order band pass Chebyshev filters with a stop band ripple of 40 db. 2000 samples half a second were taken, the reason for that is the limited RAM size of ATMEL Mega32 controller which is 2K and this much space is required for a word when we are taking 2000 samples half a second. The output of all the filters were multiplied with the gain of the respective filter and then squared. These values were accumulated with the previous values of the filter outputs. Every fingerprint received was stored in 2 KB EEPROM of AT Mega 32 microcontroller. 4.3.3 Fingerprint Calculation There is a space constraint in the ATMEL Mega32 as it has just 1 K EEPROM, each word was encoded and sampled information was stored in EEPROM. To compare the fingerprints of the stored words and the spoken word, Euclidean distance formula was used. Euclidean distance formula is: 28 D= ………… (4.4) In the above equation p1, p2, p3,…,pn are the stored fingerprints of a word in EEPROM and q1,q2,q3,….,qn are the fingerprints of a spoken word which is to be recognized. For the comparison between the words to know whether the stored word was same as that of the spoken word, Euclidean distance between them was calculated. The words were considered to be same if the calculated distance between them was the least as compared to the other words. The implementation of this formula required calculating the squares of the distances but due to the implementation of Tor’s fixed point arithmetic, the filter coefficients were multiplied with 256. Thus, squaring would have resulted in a big number which would have lead to the buffer overflow [6]. Hence a "pseudo Euclidean distance formula" was used by removing the sum and the square root, reducing the equation for the distance as follows: D= ………….. (4.5) The basic algorithm for the code was to check the ADC input at a sampling rate of 4 KHz. If the value of the ADC was greater than the threshold value it was interpreted as the beginning of a half a second long word. The samples of the spoken word were passed through 8 band pass filters and were converted into fingerprint. Euclidean 29 distance calculation found the closest match with fingerprint stored in the EEPROM and based on that the corresponding red LED was turned on. 4.3.4 Initial Threshold Calculation As the part of initialization, the value of the ADC was read 256 times by main use of timer0. The average value was calculated without doing multiplication or division. There were three values were obtained each with a gap of almost 16.4 msec. delay between the sample accumulation. After getting the values, the threshold value was taken as four times of the median values. The threshold value was calculated to detect the spoken word. This also prevented the voice recognition system to be too sensitive. The logic was implemented without much success. 4.3.5 Software Flow Figure 4.7 shows the software flow for the voice recognition system. The first step is the resetting of the board then the system starts the computation of the filter coefficients. The peripheral initialization function initializes all the ports, timers and USART of ATMEL Mega32. If PortC.7th pin is high (connecting with the switch), all the old voice fingerprints are removed from the EEPROM. The system monitors the flag in the EEPROM (address 0x00). If the flag is 0xAA, then there are words stored in the EEPROM. The corresponding word’s green LED blinks. A maximum of three words can be stored in the EEPROM. The system runs an infinite loop (while (1) with timer running at the background) to calculate the threshold noise. If the data from ADC is greater than threshold value, the sampling of the input voice starts. During the sampling process, filter 30 values are calculated using assembly functions. Once 2000 samples half a second are over, the filter outputs are stored in an array. Start Eight BPF Chebyshev Type 2 filter design freq. range from 200 Hz to 1800 Hz Peripheral_init() WD_flag = 0; DDRD = 0XFF; TCCR1A = 0b11110010; TCCR1B = 0b00010010; TIMSK = 0b00000010; Ain = 0; new_sample = 0; nms_flag = 0;it =0; group_count = 0; scount = 0; ADMUX = 0b00100000; ADCSR = 0b11000111;DDRB = 0xFF; PORTB = 0xFF;DDRC = 0x00; UCSRB = 0x18 ; UBRRL = 103 TCCR0 = 0b00001011; OCR0 = 62; Noise_detect() Threshold caculation (PORTC.7 == 0 Yes Resetting EEPROM sflag=0xff Reseting EEPROM Memory Done No sflag == 0xAA No Learning Mode Yes Stop Voice Recognitio n Mode Word Found In_word() Blinking one of the three RED LED as per word detected Compare() Comparing current sample and stored sample In_word () Three Green LED Blinks, storing fingerprints of three words in EEPROM Word 1,2 and 3 Stored in EEPROM Word Not Found Figure 4.7: Software Flow for Voice Recognition System 31 If there are words available in the EEPROM, the system calls the compare () function to compare the EEPROM samples and current samples. If there are no words in the EEPROM, the samples are written to the EEPROM. This is called the training mode. All the green LED’s will glow once all the three words are entered successfully. The compare() function calculates the shortest distance between the samples of the words stored in the EEPROM and the current sample. One of the three RED LED’s blink as per the shortest distance for each word calculated using Euclidean distance formula. 32 Chapter 5 SIMULATIONS AND DESIGN IMPLEMENTATION ANALYSIS 5.1 Speech Frequency Analysis The human speech frequency is less than 4000 Hz. Hence, as per the Nyquiest theorem the minimum sampling rate should be 8000 Hz. The analysis of the human voice was done on MATLAB before the design implementation. The data acquisition toolbox in MATLAB was used to record six words for exactly two seconds at the sampling rate of 8000 Hz. The data acquisition tool takes in the analog input from the microphone connected to the computer and sample the speech as per the sampling frequency required by the user. Data acquisition tool supports the minimum sampling frequency of 8000 Hz. Thus each two seconds word contained 8000 samples per second. Following were the MATLAB commands used to record one second words. duration = 2; % two second recording period ai = analoginput('winsound'); addchannel(ai, 1); sampleRate = get(ai, 'SampleRate') get(ai, 'SamplesPerTrigger') requiredSamples = floor(sampleRate * duration); set(ai, 'SamplesPerTrigger', requiredSamples); waitTime = duration * 1.1 + 0.5 % Buffer for the response time of the hardware start(ai) tic 33 wait(ai, waitTime); toc [data, time] = getdata(ai); % retrieving data from the getdata Figure5.1: Signal Variation of “Hello” Figure 5.1 and 5.2 shows the signal variation of words “Hello” and “One” respectively for a period of two seconds. The y axis in the figures shows signal strength in volts and the x axis shows the time duration of 2 seconds. 34 Figure5.2: Signal Variation of “One” 5.2 Filter Design and Fingerprint Generation Analysis The analysis of filter design and Fingerprint generation is explained in the following two sub sections. 5.2.1 MATLAB Filter Design and Fingerprint Generation The filter design was tested in MATLAB before it was implemented in the hardware. Eight band pass filters ranging from 200 Hz to 1800 Hz were used. The coefficients of the filter were obtained using the following commands in MATLAB: [B, A] = cheby2 (4, 40, [Freq1, Freq2]) After the calculation of the coefficients, the 8000 samples per second of the words were down sampled to 4000 samples per second. The reason is the sampling frequency of the system is 4000 Hz. The 4000 samples of each word were divided into 16 groups and 35 each group containing 250 samples. This was done to pass each sample of the word from the filter and generated 16 data points or fingerprints that define a particular word. The 16 groups were passed through these eight filters. The output of a filter for each sample was squared and added to the previous value for one group that is for 250 samples. Thus, a finger print is generated for that group. This is how each filter generated 16 fingerprints for each word. The fingerprint accumulation process is shown in the figure 5.3. Figure 5.3: Fingerprint Accumulation after Every 250 Samples for a Word The following MATLAB commands were used to generate fingerprints: l1=length(x); x1=resample(x,4000,l1); % Resampling to 4000 samples. y(:,1)=x1; while i<5 % Count for using only four words. count = 1:1:250; 36 u1 = y(count,i); % Dividing into groups of 250 (250 * 16 = 4000 samples) % segregating the 4000 samples into groups of 250 continues count = 3751:1:4000; u16 = y(count,i); output1 = filter(B1, A1, u1); % Passing each sample through the filter % collecting outputs of the filter for each sample continues output16 = filter(B1, A1, u16); ot1=output1.*output1; %Squaring the outputs of the filter. % squaring of the outputs of the filter continues ot16=output16.*output16; result1 = sum(ot1); % Adding all the squared results. % accumulation of the results continues result16 = sum(ot16); % concatenating the 16 fingerprints for this filter fingerprint_filter1(i,:) = [result1, result2, result3, result4, result5, result6, result7, result8, result9, result10, result11, result12, result13, result14, result15, result16]; The total number of fingerprints generated by the 8 band pass filters is 128. These fingerprints can be plotted on a graph and the difference between the words can be judged. Figure 5.4 shows the fingerprints of the word “Back” and “History”. 37 Figure5.4: Fingerprints of Words “Back” and “History” 5.2.2 Actual Filter Design and Fingerprint Generation The actual filter design was based on Prof Land’s optimized implementation of the second order IIR filter on AT Mega 32 microcontroller. In this case the 4 th order Chebychev filter is designed by cascading two second order Chebychev filters. The coefficients of the second order filter were obtained by the following MATLAB command: [B,A] = cheby2(2,40,[Freq1, Freq2]); The coefficients of the second 2nd order filters were obtained by the following command: [sos1, g1]=ft2sos (B1, A1,'up', 'inf'); The coefficients obtained from the above commands were floating point values. The actual design implementation was based on fixed point arithmetic. The coefficients were converted into floating point by multiplying each number by 256 and then rounding it off to the nearest integer rather than using a float 2 fix macro. The process is based on 38 Tor’s algorithm. A table 5.1 and 5.2 shows first and second 2nd order coefficients respectively for band pass filter 200–400 Hz. SOS Floating Point Filter SOS Fixed Point Filter Rounding off to Nearest Coefficient Coefficient Integer (Multiplying with 256) 1.7613 1.7613 *256 = + 450.8928 451 -0.9700 -0.9700 *256 = -248.32 -248 0.0816 0.0816 *256 = 20.8896 21 -0.1233 0.1233 *256 = -31.5648 -32 0.0816 0.0816 *256 = 20.8896 21 Table 5.1: Band Pass Filter 200-400 Hz First 2nd Order Coefficients SOS Floating Point Filter SOS Fixed Point Filter Rounding off to Nearest Coefficient Coefficient Integer (Multiplying with 256) 1.7903 1.7903 *256 = + 458.3168 458 -0.968 -0.968 *256 = -247.808 -248 8.6923 8.6923 * 256 = 2225.22 2225 -16.7363 -16.7363 *256 = - 4284.492 -4285 8.6923 8.6923 * 256 = 2225.22 2225 39 Table 5.2: Band Pass Filter 200-400 Hz Second 2nd Order Coefficients Each data point was generated after accumulating the outputs from each filter for 125 samples instead of 250 samples. The reason is the limitation of random access memory of 2KB in AT Mega32 microcontroller. The 2000 samples half a second requires almost 2KB of storage when passed from the eight filters. The total number of data points created by all the filters was 128. So for this purpose 2000 samples half a second were taken instead of the actual 4000 samples per second. 5.3 Fingerprint Comparison Analysis. All the samples of a word pass through the eight filters and the output of each filter is the accumulation of the 250 consecutive outputs square of the filters. All the words have different frequency spectrum. The same words should have the same frequency spectrum but it was seen during the simulations that even a same word at different intervals of time does not have the same fingerprints. Figure 5.5 shows the fingerprints of the word “Hello” spoken at different intervals of time and the same fingerprint varies with respect to time which is spoken at different intervals of time. Figure5.5: Fingerprints of the Word “Hello at Different Intervals of Time” 40 The relevant information of a word was stored in the fingerprint. Once we stored the fingerprints of the three words “Hello”, “Uno” and “One” in the EEPROM of AT Mega32, the spoken words were compared against the stored words. For the comparison purpose, a function was called which did the pseudo Euclidean distance calculations between the values of the fingerprints stored in the EEPROM and fingerprint of the spoken word. The function went through all the three words in the EEPROM for the comparison and the picked the one with the smallest calculated distance. Another set of words that were tried for the voice recognition was “back”, “history” and “run”. The following commands were used for the MATLAB simulations to calculate the Euclidean distance between the words: %Calculating Euclidean distance d1=sum(abs(fingerprint(:,1)-fingerprint(:,2))); % distance between back.wav and history.wav. d2=sum(abs(fingerprint(:,2)-fingerprint(:,3))); % distance between history.wav and hello22.wav. d3=sum(abs(fingerprint(:,1)-fingerprint(:,3))); % distance between back.wav and hello22.wav. d4=sum(abs(fingerprint(:,3)-fingerprint(:,4))); % distance between hello22.wav and hello33.wav. The simulations results showed that the distance between the Hello22 and Hello33 was the minimum and the same was verified on the hardware implementation of the Euclidean distance. 41 Chapter 6 CONCLUSION The implementation of the hardware based real time voice recognition system was successful. The project was 80% success as it can recognize particular combinations of three words. The user can train the system in the beginning for the three words of his/her choice and the fingerprints (frequency samples) are stored in the EEPROM of AT Mega32 controller. Each of the fingerprints takes around 300 Byte of space which makes and due to the limited EEPROM space of the microcontroller 1 KB, it was only possible to store fingerprints of just three words. The algorithm used for the voice recognition is effective and is able to recognize the words like “Back”, “History” and “Run”. The other combination of three words is “Hello”, “One” and “Uno”. The limitation of the system is that it can recognize only few combinations of the words. The Euclidean distance was calculated for the words to be recognized and the stored words and the corresponding RED blinked for the minimum distance. The calculated distance for the words was seen on the hyper terminal as well. The results were verified effectively in the practical way. The figure 6.1 shows the screen shot from the hyper terminal. The voice recognition system can be elaborated and used in various applications like voice security system, voice controlled car navigation system etc. 42 Figure6.1 – Hyper Terminal Screen Shot 43 APPENDIX Source Code ////******************************************************************//// //// Function Name //// Description //// : void main(void) //// This is the main function where it will initialize all peripherals, //// then start reading the samples from ADC and compare with EEPROM data //// //// if the new word is near to the stored word in the EEPROM //// //// using Euclidian formula, it will show nearest match //// ////*****************************************************************//// void main(void) { Peripheral_init(); // Initialize all peripherals pin=PINC; PORTC=pin; if(PORTC.7 == 0) // if port pint is 0, it will erase the EEPROM { sflag=0xff; ENWORD=0; // Flag in the EEPROM is getting erased // Word count in the EEPROM is getting erased printf(" \n\t ******** Clearing of EEPROM Memory Done******** \n\r "); } if (sflag == 0xAA) three // There are valid words available in the EEPROM and count is 44 { printf(" Voice Recognition Mode "); PORTB.0=0;PORTB.1=0;PORTB.2=0; } else // if less than three words are avaialbe in EEPROM, display the same using LEDs { printf(" Learning Mode "); if(ENWORD==0) { PORTB.0=1;PORTB.1=1;PORTB.2=1; } else if(ENWORD==1) { PORTB.0=0;PORTB.1=1;PORTB.2=1; } else if(ENWORD==2) { PORTB.0=0;PORTB.1=0;PORTB.2=1; } } while(1) // infinite loop 45 { if (latest_sample == 1) // Check whether to read samples from ADC, latest_sample is set to 1, from timer ISR. { In_word(); // Check the ADC input data is greater than thresh value (word detected), // if greater, set "flag2 =1" if(flag2 == 1) // there is a word begining detected from ADC { input = (((int)adc_in) -103); // Attenuate the default DC offset from ADC by 102 and assign to input variable to prvent overflow of input data as per fixed point arithematic. f1 = iir2_1(input); f2 = iir2_2(input); f3 = iir2_3(input); f4 = iir2_4(input); f5 = iir2_5(input); f6 = iir2_6(input); f7 = iir2_7(input); f8 = iir2_8(input); s_count++; if(s_count == 125) // Increment the sample counter 46 { // Check whether 125 samples have passed through this iteration g_count++; // Take (125 samples) 16 times to get 2000 sample : 16x 125 = 2000 samples s_count = 0; // Reseting the fingerprint sample counter. j = 0; Group_sample[i][j++] = f1; Group_sample[i][j++] = f2; Group_sample[i][j++] = f3; Group_sample[i][j++] = f4; Group_sample[i][j++] = f5; Group_sample[i][j++] = f6; Group_sample[i][j++] = f7; Group_sample[i][j++] = f8; f1 = 0;f2 = 0;f3 = 0; f4 = 0; f5 = 0; f6 = 0; f7 = 0; f8 = 0; // clearing for next inputs i++; } if(g_count == 16) { g_count = 0; s_count = 0; flag2 = 0; i = 0; j = 0; for(m = 0; m < 8; m++) 47 { for(l = 0; l < 16; l++) { cur_fing[k] = Group_sample[l][m]; // Storing the data to the one diamentional array k++; } } k = 0; if(sflag==0xAA) // if there are words available in the EEPROM, compare with it { Compare(); // comparing the current samples with stored samples in EEPROM } else // writing the samples to eeprom and incriment the word count { if(ENWORD > 3) // if word count is garbage or FF, making it to zero ENWORD=0; 48 for(o=0; o < 128; o++ ) { saved_fing[ENWORD][o]=cur_fing[o]; //Writing data to EEPROM } if((ENWORD+1)==3) // If all three words have writtern, make the flag in the EEPROM as 0xAA at location 0x00 { sflag=0xAA; ENWORD = ENWORD + 1; // Incrimenting the word count in EEPROM } else ENWORD = ENWORD + 1; // Incrimenting the word count in EEPROM printf(" Data Stored %d ",ENWORD); printf( "\n\r"); if(ENWORD==0) for word count { // Driving correspoding green LEDs 49 PORTB.0=1;PORTB.1=1;PORTB.2=1; } else if(ENWORD==1) { PORTB.0=1;PORTB.1=1;PORTB.2=0; // 1st Green LED blinking. } else if(ENWORD==2) { PORTB.0=1;PORTB.1=0;PORTB.2=0; // 1st and 2nd Green LEDs blinking. } else if(ENWORD==3) { PORTB.0=0;PORTB.1=0;PORTB.2=0; // 1st, 2nd and 3rd Green LEDs blinking. } } } // if(g_count == 16){ } //if(flag2 == 1) latest_sample = 0; // Re initialize to 0 for next ADC conversion. 50 } //latest_sample } // while(1) //main } 51 REFERENCES [1] Jim Baumann, “Voice Recognition”, Human Interface Technology Laboratory, fall 1993 [2] Richard L. Klevans, Robert D. Rodman, “Voice Recognition”, Artech House on Demand, 1997 [3] Ton Kalker and Jaap Haitsma, “A Review of Algorithms for Audio Fingerprinting”, fall 2000 [4] Yoshikazu Miyanaga, “Robust Speech Recognition and its ROBOT Implementation”, Hokkaido University, 2009 [5] “Atmel Mega32 PU datasheet”, Atmel, Available: http//www.atmel.com [6] Zhu Liu and Yao Wang, “Audio Feature Extraction and Analysis for Scene Segmentation and Classification”, Polytechnic University, 1999 [7] Haitsma, Kalker, Oostveen, and Philips Res, “An Efficient Database Search Strategy for Audio Fingerprinting”, Multimedia Signal Processing, 2002 IEEE Workshop