VOICE RECOGNITION SYSTEM BASED ON AUDIO

advertisement
VOICE RECOGNITION SYSTEM BASED ON AUDIO FINGERPRINTING
Mantej Singh Sahota
B.E., Punjab Technical University, India, 2007
PROJECT
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
ELECTRICAL AND ELECTRONICS ENGINEERING
at
CALIFORNIA STATE UNIVERSITY SACRAMENTO
Fall
2010
VOICE RECOGNITION SYSTEM BASED ON AUDIO FINGERPRINTING
A Project
by
Mantej Singh Sahota
Approved by:
__________________________________, Committee Chair
Jing Pang, Ph.D.
__________________________________, Second Reader
Preetham Kumar, Ph.D.
____________________________
Date
ii
Student:
Mantej Singh Sahota
I certify that this student has met the requirements for format contained in the University
format manual, and that this project is suitable for shelving in the Library and credit is to
be awarded for the Project.
__________________________, Graduate Coordinator _______________________
Preetham Kumar, Ph.D.
Date
Department of Electrical and Electronics Engineering
iii
Abstract
of
VOICE RECOGNITON SYSTEM BASED ON AUDIO FINGERPRINTING
by
Mantej Singh Sahota
Voice recognition is the ability of a machine or a program to receive and interpret
dictation, or to understand the spoken words. In the recent years, an interest has been
received by systems for audio fingerprinting based voice recognition systems which
enable automatic content-based identification by extracting the audio signatures from the
signal and matching them to the fingerprint of the word to be recognized.
This project discusses the implementation of a hardware based, real-time voice
recognition system which is capable of storing the fingerprints and later recognizing them
for the three words of the user’s choice. This is achieved by implementing band pass
filters in the assembly language with fixed-point arithmetic onto to AT Mega32
microcontroller. The outputs of the filters are squared and accumulated. This approach
helps in saving a lot of time for the speech sample to be processed for the frequency
spectrum before the next samples comes in. The analysis of the voice is made by
calculating the Euclidean distance between the saved and the current fingerprints of the
words. This technique provides high success rate in recognizing the word. The algorithm
for the voice recognition system is written in assembly and C programming language.
iv
The successful demonstration of the voice recognition system was presented on
AT Mega16 + 232 kit with AT Mega32 microcontroller.
_______________________, Committee Chair
Jing Pang, Ph.D.
_______________________
Date
v
ACKNOWLEDGEMENT
Special thanks to Dr. Jing Pang for guiding me throughout my project and encouraging
me to implement new ideas. I would also like to thank Dr. Preetham Kumar, graduate
coordinator of Electrical and Electronics Engineering department at California State
University for proof reading my desertion and providing me with best lab facilities and
latest equipment for testing.
vi
TABLE OF CONTENTS
Page
Acknowledgement………………………………………………………………………..vi
List of Tables……………………………………………………………………………...x
List of Figures…………………………………………………………………………….xi
Chapter
1. INTRODUCTION……………….………………………………………......…...1
1.1 Introduction to Voice Recognition…………………………………...…….…1
1.2 Purpose of the Project ………………….……………………………………..1
1.3 Significance of the Project ……………………………………………………2
1.4 Report Organization…………………………………………………………...2
2. BASICS OF VOICE RECOGNITION AND AUDIO FINGERPRINTING…….4
2.1 Voice Recognition …………………...…………………………………….....4
2.2 Methods of voice recognition ……….…………………….…………….…....4
2.2.1
Template Matching Approach…………………….…………………..5
2.2.2
Feature Analysis Approach……………………………………………6
2.3 Concept of Audio Fingerprinting ……………………………………….….…8
2.4 Properties of Audio Fingerprint Based Voice Recognition System…………10
3. ATMEL MEGA32 OVERVIEW..…………………..…………………...……...12
3.1 AT Mega32 ADC ……………………………………………………..……..12
3.2 EEPROM Data Memory..………………………………..…………………..15
vii
3.2.1
EEPROM Read/Write Access Registers……………………………..15
3.2.2
EEPROM Usage in Voice Recognition System……………………..16
3.3 USART………………………………………………………………………16
3.4 Timer/Counter Registers…….……………………………………………….18
4. PROJECT DETAILS…………………………………………………………….20
4.1 Overview……………………………………………………………………..20
4.2 Design Implementation Details………………………………………………20
4.2.1
Microphone…………………………………………………………..21
4.2.2
High Pass Filter………………………………………………………21
4.2.3
Amplifier Stage………………………………………………………22
4.2.4
ATMEL Mega16 + 232 and AT Mega 32 Processor………………...23
4.2.5
LED Circuitry……………...………………………………………...24
4.3 Design Strategy………………………………………………………………25
4.3.1
Filter Design……………………………………………………….…26
4.3.2
Fingerprint Generation…………………………………………….…27
4.3.3
Fingerprint Calculation………………………………………………27
4.3.4
Initial Threshold Calculation………………………………………...29
4.3.5
Software Flow………………………………………………….…….29
5. SIMULATIONS AND DESIGN IMPLEMENTATION ANALYSIS……….…32
5.1 Speech Frequency Analysis ………………………………………….……...32
5.2 Filter Design and Fingerprint Generation Analysis…………………….……34
viii
5.2.1
MATLAB Filter Design and Fingerprint Generation…………….….34
5.2.2
Actual Filter Design and Fingerprint Generation……………………37
5.3 Fingerprint Comparison Analysis……………………………………………39
6. CONCLUSION……………………………………….…………………………41
Appendix Source Code…...…………………………………………………….….…….43
References………………………………………………………………………………..51
ix
LIST OF TABLES
1. Table 3.1: USBR Setting for Commonly Used Frequencies…………………….18
2. Table 5.1: Band Pass Filter 200-400 Hz First 2nd Order Coefficients…………...38
3. Table 5.2: Band Pass Filter 200-400 Hz Second 2nd Order Coefficients……..…38
x
LIST OF FIGURES
1. Figure 2.1: Voice Recognition Based on Template Matching……………..……..6
2. Figure 2.2 Feature Extraction Model for Voice Recognition……………………..7
3. Figure 2.3: Clips and Frames used in Feature Analysis…………………………...7
4. Figure 2.2: Spectrogram of a Word “one”……………………………………….10
5. Figure 3.1: AT Mega32 Block Diagram…………………………………............13
6. Figure 3.2: ADC Auto Trigger Logic……………………………….…...………14
7. Figure 4.1: Voice Recognition System Block Diagram…………………...……..20
8. Figure4.2: Microphone with Amplification Circuitry……….…………...…...…21
9. Figure 4.3: Schematic of Microphone Amplification Circuit……………………22
10. Figure 4.4: ATMEL Mega32 Developer’s Kit………………………....….……23
11. Figure 4.5: LED Circuitry…………………………………..………….……...…24
12. Figure 4.6: Schematic of AT Mega32 Connected with LED and Microphone
Circuitry ………………………………………………………..…………….….25
13. Figure 4.7: Software Flow for Voice Recognition System………………………30
14. Figure 5.1: Signal Variation of “Hello”……………………...…………………..33
15. Figure 5.2: Signal Variation of “One”………...…………………………………34
16. Figure 5.3: Fingerprint Accumulation after Every 250 Samples for a Word……35
17. Figure 5.4: Fingerprints of Words “Back” and “History”…………………...…..37
18. Figure 5.5: Fingerprints of the Word “Hello at Different Intervals of Time”...…39
19. Figure 6.1 – Hyper Terminal Screen Shot……………………………………….42
xi
1
Chapter 1
INTRODUCTION
1.1 Introduction to Voice Recognition
The term “voice recognition” is used to refer to a recognition system that can be
trained to a particular speaker as is the case for most desktop recognition software. The
voice recognition is performed on the basis of frequency content in a voice. In order to
achieve the frequency contents, several samples of the same sound are averaged during
the training phase. These averaged samples are referred to as the fingerprints. The
frequency content of a sound can be compared to the averaged stored fingerprints by
treating them as vectors and calculating the difference between them. If the distance is
close enough to the reference fingerprint then it is considered to be a match [1].
1.2 Purpose of the Project
The purpose of the project is to implement a voice recognition algorithm on AT
Mega32 microcontroller in an efficient manner. The system should be capable of storing
the fingerprints of three words of user’s choice in the EEPROM of AT Mega32
microcontroller as a dictionary and then later using the stored fingerprints to recognize
the spoken word by implementing the template matching (fingerprint matching)
algorithm. The fingerprint matching is achieved by calculating the Euclidean distance
between the fingerprints of the dictionary word and the spoken word. The minimum
distance with the relative word in the dictionary is recognized as the spoken word. The
process comprises of three steps. The steps are initial threshold (noise) calculation, filter
2
implementation, fingerprint generation, storing the fingerprint and lastly fingerprint
comparison.
1.3 Significance of the Project
The project helps in understanding the implementation of the voice recognition
algorithm on a hardware platform. From the hardware prospective, the project gives an
introduction to the AT Mega16 programmer kit and AT Mega32 microprocessor
implementation of real-time voice recognition system. The project gives a basic idea of
implementing an optimized filter design in assembly language. The project can be
elaborated in future and used in various applications like voice based security systems,
voice based car navigation systems and in various other applications. Algorithms for
audio fingerprint generation, filter implementation and template matching using
Euclidean distance were
tested in code vision AVR complier and AT Mega32
microcontroller effectively, and were written in assembly and advanced C programming.
All the results were also simulated in MATLAB.
1.4 Report Organization
Chapter two describes the basics of voice recognition and audio fingerprinting
technology. It gives an overview of the methods for voice recognition. The methods
include “template matching” and “feature extraction”.
Chapter three gives an overview of AT Mega 32 microprocessor used for voice
recognition.
Chapter four gives the details of the implementation of the project both from
hardware and software prospective.
3
Chapter five presents the implementation and the MATLAB simulations analysis
for the audio fingerprint generation and fingerprint matching algorithm, using Euclidean
distance.
Chapter six provides the conclusions of the project, challenges, limitations of the
project and the future work associated with it. The references are provided in
Bibliography section. The program for simulation and hardware implementation are
provided in Attachment section.
4
Chapter 2
BASICS OF VOICE RECOGNITION AND AUDIO FINGERPRINTING
2.1 Voice Recognition
Voice Recognition is the technology by which the human voice is converted in to
the electrical signals. These electrical signals are converted into the coding patterns
which already have predefined meaning assigned to it. The meaning is nothing but the
database of the fingerprints (frequency samples) of the words already stored in the
memory of the voice recognition system. The voice recognition is also known by another
name called speech recognition. For voice recognition most of the focus is on the human
voice. The reason is that humans most naturally use their voice for most part of the
communication [2].
The computer systems which are capable of voice recognition are designed with
such precision that they produce a specific type of response on receiving a voice signal.
As we know that even the fingerprint of the same word spoken by a same person at
different instance of time is never the same. Also, each human voice is different and even
the same words can have different meaning if they are spoken in different context. To
overcome this, over the years several approaches have been implemented with different
degrees of success [1].
2.2 Methods of Voice Recognition
There are various approaches for the voice recognition but most commonly it is
divided into two categories: “template matching” and “feature analysis”. The most
accurate and simple approach out of these two is template matching, if implemented
5
correctly. It also has some limitations though. In the template matching approach the first
step for the user to speak a word or a phrase into a microphone.
2.2.1 Template Matching Approach
The electrical signals from a microphone are analog so these are converted into
the digital signal with the help of analog to digital converter. This digitized output is
stored in the memory. Now, to determine the meaning of this input voice, the computer
attempts to match it with a sample or a template that has some kind of a knowing
meaning. This is just like input commands that are sent by the keyboard. The voice
recognition system already has a stored template and it just attempts to match the current
template to the stored one. The template is usually also known a fingerprint (frequency
sample).
It is a fact that no two humans can have the same voice. This is also true that a
voice recognition system cannot contain a template for each potential user [1]. This
means that the system needs to be trained with new user’s voice input before that user can
use the system for voice detection purpose. During the training session, a user has to
speak a word several times into the microphone. Each time the user speaks the word the
program in the voice recognition system displays that word on to the screen. Once the
user feels comfortable with the system and words spoken by him/her are displayed
correctly then the training session ends. There is one limitation with this approach and
that is the system is limited to the vocabulary. This type of a system is called “speaker
dependent”. The system with good amount of memory can have hundreds of words and
6
even short phrases and the recognition accuracy is almost 98 percent [2]. Figure 2.1
shows the template matching process.
Figure2.1: Voice Recognition Based on Template Matching
2.2.2 Feature Analysis Approach
Another form of voice recognition systems is available which are based on the
feature extraction approach. These systems are usually speaker independent. Instead of
finding an exact or a near match to the actual voice template and the stored template, this
method first processes the voice using the Fast Fourier Transforms or the Linear
Predictive Coding. In the next step the system tries to find the similarities between the
expected inputs and the actual digital voice inputs. Now, with this approach the system
will find the similarities that are present for good range of speakers and thus the system
need not to be trained by the user before using it for voice detection purpose [2]. Figure
2.2 shows a feature extraction model for voice recognition.
There are many ways of characterizing an audio signal. Mostly, the audio signals
are categorized into two domains: time-domain and frequency-domain features. This can
be explained by taking an example in which an audio sample at 22 KHz is divided into
7
Figure 2.2 Feature Extraction Model for Voice Recognition
clips of one second long. Feature analysis is done on each clip by calculating a feature
vector for each clip. These features are calculated on the basis of frame level features.
The frame level features are computed by overlapping short intervals known as frames.
Each frame contains 512 samples which are shifted by 128 samples from the previous
frame. Figure 2.3 shows the relationship between the clips and the frames.
Figure 2.3: Clips and Frames used in Feature Analysis
The differences between the template matching and the feature extraction
technique are that the feature extraction technique can recognize a speech that is spoken
in different accents and varying speeds of speech delivery, pitch and volume. The
8
template matching technique is completely lags these features. The implementation of the
speaker independent systems is a difficult task with some of the greatest hurdles like
different accents and inflections used by the speakers around the world with different
nationalities. This is the reason that the accuracy of the speech independent system drops
to 90 percent as compared to speech dependent voice recognitions system with 98 percent
accuracy.
There is another way of differentiating the voice recognition systems by
determining the kind of speech they can recognize. The speech can be different words,
connected words and it can also be continuous. The simplest system out of these systems
to implement is different word systems. For this the user has to take breaks in between
the words to be recognized. This gives enough time to the system to process the spoken
word. On the contrary the continuous speech systems are the most difficult to implement.
The reason for that is that continuous speech involves the words running into each other
without any significant pauses. This gives very little time to the system to process the
spoken words.
2.3 Concept of Audio Fingerprinting
An audio fingerprint is nothing but the samples of frequencies of speech or voice,
which summarizes the whole recoding. In the recent years, the audio fingerprinting
technology has evolved and has gained a lot, as it allows recognizing the audio
irrespective of its format. The audio fingerprinting technology is known by different
names like pattern matching, multimedia information retrieval system and cryptography
[3].
9
Audio fingerprinting works on a principle of extracting audio frequency samples
(fingerprints) from an audio stream and comparing these samples with the database of pre
stored samples and hence leading to voice detection. This process is not as simple as it
seems to be. The biggest challenge in this case is giving the correct result of pattern
matching. The reason is that it the fingerprints of many words are closely matched so it
becomes a challenging for the system to detect the correct match. It just gives the result
on a perfect match bases which may be wrong. An approach that one can think of to get
the appropriate result is by matching the whole speech but that is not an effective and
efficient method [3]. The other efficient approach could be cyclic redundancy check but
this way includes the compression of the binary formats and one flip of a bit here and
there can give completely absurd results.
The human speech can be analyzed by looking at different intensity frequencies in
the voice with respect to time. Figure 2.1 shows the spectrogram of a word “one”. The x
axis represents time and the y axis represents frequency and the black spots shows the
intensity of the word. If the spectrum is observed closely then one can see that the
intensity of the word “one” lies in between time 0.6 and 0.9 seconds. The red area in the
spectrogram represents high energy levels and the lower energy levels are represented by
shades of yellow and orange.
10
Figure 2.2: Spectrogram of a word “one”.
2.4 Properties of Audio Fingerprint Based Voice Recognition System
The audio fingerprint based voice recognition system should have the following
properties [3, 4]:

Robustness – The fingerprint obtained from a noisy version of audio should be
similar to that of the original audio. This requires the system to be really fast and
precise.

Pair wise independence – The audio fingerprints of a word spoken by the same
speaker at different instances of time should be same. However, this is not the
case. Every time a person speaks the same word at different instances of time, the
intensity is different each time and hence the fingerprints are not the same.

Quick database search – A voice recognition system designed for a practical
application then it should be able to search a database quickly which is having a
large number of fingerprints.

Versatility – The voice recognition system should have an ability to extract and
detect the audio irrespective of the audio format.
11

Reliability – Methods used by the voice recognition system to access a query is
also important as it may lead to copyright violations in case of song recognition.

Fragility – The voice recognition system may be used for applications in which
the recognition of the content integrity of the audio is important. The voice
recognition system should be capable to detect the changes in the audio as
compared to the original one.
12
Chapter 3
ATMEL MEGA32 OVERVIEW
AT Mega 32 is a general purpose low power CMOS 8-bit microcontroller. It is
based on AVR RISC architecture. The key features of this microcontrollers includes up
to 16 MIPS throughput at 16 MHz, 32 Kbytes of flash program memory, 2 Kbyte of
internal SRAM, Kbyte of EEPROM, two 8 bit times, 8 channel 10 bit A/D converter,
USART interface, JTAG interface etc. Figure 3.1 shows the block diagram of AT
Mega32 [5].
3.1 AT Mega32 ADC
The Analog to Digital Converter in AT Mega32 is based on 10 bit successive
approximation. The ADC is connected to an 8 channel multiplexer. This arrangement
allows 8 single ended voltage inputs that are constructed from the pins of port A. The
ADC converts the analog input to 10 bit digital output through successive approximation.
The analog inputs and the differential gain values can be selected to writing particular
bits to the ADMUX register [5]. The ADMUX register is an 8 bit register. The ADC is
enabled by selecting ADC bit named ADEN in ADCSRA register. If it is set one then the
ADC is enabled otherwise it is disabled. The 10 bit ADC result is presented by the ADC
data registers. The name of these registers is ADCH and ADCL. If the ADC is used in the
left adjust result mode and not more than 8 bit precision is required, then it is sufficient to
read ADCH register only.
13
14
Figure3.1: AT Mega32 Block Diagram
A single conversion is started in ADC by writing logic 1 to bit ADC Start
Conversion Bit. The bit stays high as long as the conversion is completed. In between the
conversion, if the data channel is changed, ADC finishes the current conversion before
making a move to the next one. The conversion can also be started automatically. The
automatic mode is initiated by setting an ADC Auto Trigger Enable bit. The source that
provides the trigger can also be selected by setting the ADC trigger select bits. If the
trigger bit is set even when the conversion has been completed, the new conversion does
not start. In this situation, an interrupt is generated which is needed to be cleared before
the new conversion attempt is made again. Figure shows the ADC auto trigger logic. If
the auto trigger mode is enabled, the single conversion can be started by writing ADSC in
ADCSRA to one. The ADSC is also used to determine the progress of the conversion.
Figure3.2: ADC Auto Trigger Logic
An analog input source that is applied to ADC (0-7) has pin capacitance as well as
input leakage of that pin particular pin. This happens even when the pin is not selected
15
for input. The analog to digital converter is optimized for analog signals like microphone
output with an output impedance of approximately 10 KΩ. If a source matching to this
configuration is used then the sampling time is negligible. If a source with higher
impedance is used then the sampling time depends on the time taken by the capacitor
associated with ADC to charge.
For the purpose of this project, the ADC is used to convert the analog output of
the microphone to the digital value. The value of the register ADMUX used was
00100000. Bits 7 and 6 were zero which means that internal Vref is turned off. Bit 5
which is ADC left adjust result bit. This bit was 1 which means ADC left adjust result is
turned on. The bits 4:0 are the analog channel and gain selection bits. These bits are used
to select the pin from where the ADC gets the analog input. The value of all the bits was
taken zero to select the input at ADC0 [5].
3.2 EEPROM Data Memory
AT Mega32 contains 1 KB of Electrical Erasable Programmable Read Only
Memory. It is a separate data space in AT Mega32. The EEPROM can be written and
erased 100,000 times.
3.2.1 EEPROM Read/Write Access Registers
The EEPROM registers are accessible in the I/O space. EEPROM have varying
access times. If the user’s code contains instructions to write the EEPROM then some
precautions must be taken. If a heavily filtered power supply is used then the Vcc rises
and falls slowly on power up and down. A specific order of instructions must be followed
while writing to the EEPROM. The CPU is halted for two clock cycles when a write
16
attempt is made to EEPRM. When a read attempt is made, CPU is halted for four clock
cycles before the next set instructions are executed. The various registers that correspond
to EEPROM are EEPROM address register, EEPROM data register and EEPROM
control register.
The EEPROM address register is a 16 bit wide register and the bits from 15:10
are reserved in it. The bits 9:0 specify the address in the 1024 bytes EEPROM space. The
data bytes can be addressed linearly from 0 to 1023. The initial value for this register
must be defines before it can be used. The EEPROM data register is 8 bit wide. This
register contains the data that is to be written to the EEROM in the address that is given
by the EEPROM address register. The EEPROM control register is 8 bit wide register.
Bits 7:4 are reserved for future use. Bit 3 is the ready interrupt enable bit. Bit 2 is
EEPROM master write enable bit. This bit is used to determine whether setting the bit
EEWE to 1 caused the EEPROM to be written or not. Bits 1 and 0 are EEPROM write
enable and read enable bits. These bits are set and reset according to read and write
requirements to EEPROM.
3.2.2 EEPROM Usage in Voice Recognition System
In the voice recognition project, the EEPROM was used for writing the
fingerprints of three words which correspond to the 75.2 % usage of the memory. The
EEPROM was erased by using the pin 7 of the port C. Whenever, the input at the pin 7 of
port C was high, and EEPROM was erased completely. Due to the limited space of the
EEPROM only three words were stored.
17
3.3 USART
The Universal Synchronous and Asynchronous serial Receiver and Transmitter is
a serial communication port that is integral part of AT Mega32. This port is used to print
the microcontroller messages on the HyperTerminal. Also, this port acts as a serial
console to transfer data between the AT Mega32 and the peripheral devices. It has
different modes of operation and the baud rate can be adjusted according to the user’s
requirement by witting some values to the UBRRL register. All these values are available
in AT Mega32 datasheet according to different frequencies. The table 3.1 shows some of
the values of the UBRRL register as per the clock frequency used. The UBRRL is a 16
bit register. The bit 15 is a regular select bit and the bits from 14:12 are reserved bits. The
bits 11:0 are used to specify the value for the USART baud rate as per the table given in
the AT Mega32 data sheet.
For the purpose of the voice recognition project, the value of the UBRRL
register was 103. This value means that the system clock is 16 MHz and the baud rate
used is 9600. Accordingly, the HyperTerminal was also adjusted to the corresponding
value. The USART was used to print the information like peripheral initialization, word
detected and the Euclidean distance calculated between the stored fingerprint and the
fingerprint of the spoken word. The transmission and reception of USART was set by
writing 0x18 values to the register UCSRB.
18
Table 3.1: USBR Setting for Commonly Used Frequencies
3.4 Timer/Counter Registers
There are two timer/counter available in AT Mega32 namely Timer/Counter0 and
timer/Counter1 Presclaers.
These are usually used to set the prescale values.
Timer/Counter0 and Timer/Counter1 shares the same prescaler module but the
Timer/Counter0 can have different prescaler settings. The internal clock source provides
16 MHz clock frequency. By using appropriate values in the timers, the prescaled clock
can have a frequency of fclk/8, fclk/64, fclk/256 or fclk/1024 [5]. Special function I/O
19
register is one of the registers that is associated with the timers. It is an 8 bit register. Bit
0 is a prescaler reset timer/counter1 and timer/counter bit. If this bit is set then the
prescaler gets reset. As both the timers share the same prescaler, so the reset of it affects
both the timers. Thus this bit is always read as zero.
The registers which are associated with the timer 1 are TCNT1, OCR1A/B and
ICR1. All these registers are 16 bit registers. There are some special procedures that are
needed to be followed while accessing these registers. The procedures are explained in
detail in the AT Mega32 data sheet. The Timer/Counter can be clocked internally with
the help of a prescaler or by an external clock source connected to pin T1. The double
buffered OCR1A/B registered is always compared to the value in the Timer/Counter. The
result of their match can be used to generate a PWM waveform. The result also sets one
of the bit named compare match flag which can be used to generate an interrupt.
In the voice recognition project, different values of the timers were used at
different intervals of time, to get different clock frequencies. To sample the ADC input at
4000 Hz, the value of the register TCCR0 was 00001011 and the counter register TCNT0
was 0. The value of the output control register (OCR0) was set 64. This means that when
the counter value in TCNT0 started incrementing and became equal to 64, the ADC input
was sampled at 4032 Hz which is close to the sampling frequency 4000 Hz.
20
Chapter 4
PROJECT DETAILS
4.1
Overview
The implementation of the project included various fields of electrical engineering
like analog design, digital design, digital signal processing and systems programming in
C. Figure 4.1 shows the block diagram of the voice recognition system.
Figure 4.1: Voice Recognition System Block Diagram
4.2
Design Implementation Details
The hardware design for the voice recognition system was divided into the
following five parts:-
21
4.2.1 Microphone
The microphone used in the project is a two pin microphones. One of the pin is
connected to ground and the other one is for the output. The operating power is supplied
to the microphone from the Vcc terminal of ATMEGA16 + 232 board. Pull up resistors
has been used to avoid damaging the microphone. The microphone has a response of 9
KHz which makes it suitable for the voice recognition system. It is a known fact that
most of the first and second harmonics of the human speech lies close to 2 KHz. The
output of the microphone is around 1.93 mV. Figure 4.2 shows the microphone with an
amplification circuitry
Figure4.2: Microphone with Amplification Circuitry
4.2.2 High Pass Filter
The output from the microphone is passed through high pass filter. This analog
RC filter is having .22 uF capacitor and a 2 KΩ resistor. The cutoff frequency of the
capacitor is nearly 160 Hz and the noise added by the other electronic components is 60
22
Hz as the power operation in America is at 60 Hz. The value of 160 Hz is fairly close to
the lower limit of the human speech frequency. This filter also cutoff the unnecessary
noise of 60 Hz. Another advantage of using this RC filter is to prevent the dc bias from
going to the op amp.
4.2.3 Amplifier Stage
The op amp used for the amplification stage is LM358p. The LM358p has a very
good slew rate of 0.3V/μs and the response of the op amp for the input signal from
microphone is better than the other op-amps. The formula used for the calculation of the
gain (Va) is Va = Feedback resistance/ input resistance. 1M / 2K = 1, 000000/2,000 =
500. The same was verified with the voltmeter. The output of the op-amp comes out to be
nearly 4 V which is good enough for the words to be recognized by the ATMEL Mega32.
Figure 4.3 shows the circuit diagram of amplification stage.
Figure 4.3: Schematic of Microphone Amplification Circuit
23
4.2.4 ATMEL Mega16 + 232 and AT Mega 32 Processor
The developer kit used for the project is ATMEL mega16 + 232 with
ATMEL MEGA32 processor. Figure 4.4 shows the ATMega16 + 232 developer kit. The
key features of the kit include a USART connector for printing the messages on the hyper
terminal, ISP connector and four I/O ports with a JTAG interface for the boundary scan.
The assembly and C code is converted into a hex file and downloaded to the AT Mega32
flash memory using AVR dude (hardware module) via ISP connector present on the kit.
The board is responsible for the providing 5V power to the LED and the microphone
amplification circuitry. The microphone signal from the amplification circuitry connects
to the pin PA0 of the ATMEL Mega32 microcontroller.
Figure 4.4: ATMEL Mega32 Developer’s Kit
24
4.2.5 LED Circuitry
The LED circuitry consists of six LEDs. Three of the LEDs are green and
represents the stored words in EEPROM and three are red LEDs. One of the red LEDs
glows when a spoken word fingerprint matches with the fingerprints of the stored in the
EEPROM. 2 KΩ resistors are connected with each LED for current limiting purpose.
Figure 4.5: LED Circuitry
All the LEDs connect to the port B of AT Mega 32 microcontroller. Figure 4.5 shows the
LED circuitry.
The figure 4.6 shows the full schematic of the board, microcontroller, the
amplification circuitry with microphone and the LED circuitry connected to it. The
EEPROM is erased using a switch that is connected to the pin 7 of the Port C.
25
Figure 4.6: Schematic of AT Mega32 Connected with LED and Microphone Circuitry
4.3 Design Strategy
The sampling frequency required to sample the spoken word is calculated based
on the Nyquist theorem. It states that - “To avoid aliasing, the minimum sampling rate
should be equal to or more than the highest frequency within the information signal”. The
sampling frequency of 4000 Hz for the voice recognition system is optimal as the human
voice frequency varies between 0 to 4 KHz. The second and the third harmonics of the
human speech are close to 2 KHz.
26
4.3.1 Filter Design
For the speech analysis, eight digital IIR filters are used. These eight 4th order
Chebyshev Band Pass Filter have a stop band ripple of 40 db. The Chebyshev filters are
preferred over the other filters because these filters provide sharp transitions after the
cutoff frequency which is necessary during the speech analysis. The band pass filters
used for the filter design are:1st BPF - 200 Hz to 400 Hz
2nd BPF - 400 Hz to 600 Hz
3rd BPF - 600 Hz to 800 Hz
4th BPF - 800 Hz to 1000 Hz
5th BPF - 1000 Hz to 1200 Hz
6th BPF - 1200 Hz to 1400 Hz
7th BPF - 1400 Hz to 1600 Hz
8th BPF - 1600 Hz to 1800 Hz
The Low pass filter (< 200 Hz) was neglected was because of very high
noise levels. To design 4th order filters, two second order Chebyshev filters are cascaded
using the "Direct Form II Transposed" implementation of a difference equations.
y1(n) = b11*x1 + b12*x1(n-1) + b13*x1(n-2) – a11*y1(n-1) – a12*y1(n-2) ………..(4.1)
y2(n) = b21*x2 + b22*x2(n-1) + b23*x2(n-2) – a21*y2(n-1) – a22*y2(n-2) ………..(4.2)
yout = g * y2(n)
………..(4.3)
27
In the equations a and b are the filter coefficients. The values of these coefficients
can be calculated using MATLAB and that id described in chapter 5.
4.3.2 Fingerprint Generation
The first part was to calculate the threshold value above which the input can be
recognized as a word. The calculation of the threshold value was done by reading the
input of ADC to a temp variable adc_in using a timer 0 and summing up the values 256
times, which was considered as a part of calculating noise value. The value from the
adc_in was passed through eight 4th order band pass Chebyshev filters with a stop band
ripple of 40 db. 2000 samples half a second were taken, the reason for that is the limited
RAM size of ATMEL Mega32 controller which is 2K and this much space is required for
a word when we are taking 2000 samples half a second.
The output of all the filters were multiplied with the gain of the respective filter
and then squared. These values were accumulated with the previous values of the filter
outputs. Every fingerprint received was stored in 2 KB EEPROM of AT Mega 32
microcontroller.
4.3.3 Fingerprint Calculation
There is a space constraint in the ATMEL Mega32 as it has just 1 K EEPROM,
each word was encoded and sampled information was stored in EEPROM. To compare
the fingerprints of the stored words and the spoken word, Euclidean distance formula was
used. Euclidean distance formula is:
28
D=
………… (4.4)
In the above equation p1, p2, p3,…,pn are the stored fingerprints of a word in
EEPROM and q1,q2,q3,….,qn are the fingerprints of a spoken word which is to be
recognized. For the comparison between the words to know whether the stored word was
same as that of the spoken word, Euclidean distance between them was calculated. The
words were considered to be same if the calculated distance between them was the least
as compared to the other words. The implementation of this formula required calculating
the squares of the distances but due to the implementation of Tor’s fixed point arithmetic,
the filter coefficients were multiplied with 256. Thus, squaring would have resulted in a
big number which would have lead to the buffer overflow [6]. Hence a "pseudo
Euclidean distance formula" was used by removing the sum and the square root, reducing
the equation for the distance as follows:
D=
………….. (4.5)
The basic algorithm for the code was to check the ADC input at a sampling rate of
4 KHz. If the value of the ADC was greater than the threshold value it was interpreted
as the beginning of a half a second long word. The samples of the spoken word were
passed through 8 band pass filters and were converted into fingerprint.
Euclidean
29
distance calculation found the closest match with fingerprint stored in the EEPROM and
based on that the corresponding red LED was turned on.
4.3.4 Initial Threshold Calculation
As the part of initialization, the value of the ADC was read 256 times by main use
of timer0. The average value was calculated without doing multiplication or division.
There were three values were obtained each with a gap of almost 16.4 msec. delay
between the sample accumulation. After getting the values, the threshold value was taken
as four times of the median values. The threshold value was calculated to detect the
spoken word. This also prevented the voice recognition system to be too sensitive. The
logic was implemented without much success.
4.3.5 Software Flow
Figure 4.7 shows the software flow for the voice recognition system. The first step is
the resetting of the board then the system starts the computation of the filter coefficients.
The peripheral initialization function initializes all the ports, timers and USART of
ATMEL Mega32. If PortC.7th pin is high (connecting with the switch), all the old voice
fingerprints are removed from the EEPROM. The system monitors the flag in the
EEPROM (address 0x00). If the flag is 0xAA, then there are words stored in the
EEPROM. The corresponding word’s green LED blinks. A maximum of three words can
be stored in the EEPROM. The system runs an infinite loop (while (1) with timer running
at the background) to calculate the threshold noise. If the data from ADC is greater than
threshold value, the sampling of the input voice starts. During the sampling process, filter
30
values are calculated using assembly functions. Once 2000 samples half a second are
over, the filter outputs are stored in an array.
Start
Eight BPF Chebyshev
Type 2 filter design freq.
range from 200 Hz to 1800
Hz
Peripheral_init()
WD_flag = 0; DDRD = 0XFF; TCCR1A = 0b11110010; TCCR1B = 0b00010010;
TIMSK = 0b00000010;
Ain = 0;
new_sample = 0;
nms_flag = 0;it =0;
group_count = 0;
scount = 0; ADMUX = 0b00100000; ADCSR = 0b11000111;DDRB = 0xFF;
PORTB = 0xFF;DDRC = 0x00; UCSRB = 0x18 ;
UBRRL = 103
TCCR0 = 0b00001011; OCR0 = 62;
Noise_detect()
Threshold
caculation
(PORTC.7 == 0 Yes
Resetting
EEPROM
sflag=0xff
Reseting
EEPROM
Memory
Done
No
sflag ==
0xAA
No
Learning
Mode
Yes
Stop
Voice
Recognitio
n Mode
Word
Found
In_word()
Blinking one of
the three RED
LED as per
word detected
Compare()
Comparing current
sample and stored
sample
In_word ()
Three Green LED
Blinks, storing
fingerprints of three
words in EEPROM
Word 1,2 and
3 Stored in
EEPROM
Word Not
Found
Figure 4.7: Software Flow for Voice Recognition System
31
If there are words available in the EEPROM, the system calls the compare ()
function to compare the EEPROM samples and current samples. If there are no words in
the EEPROM, the samples are written to the EEPROM. This is called the training mode.
All the green LED’s will glow once all the three words are entered successfully. The
compare() function calculates the shortest distance between the samples of the words
stored in the EEPROM and the current sample. One of the three RED LED’s blink as per
the shortest distance for each word calculated using Euclidean distance formula.
32
Chapter 5
SIMULATIONS AND DESIGN IMPLEMENTATION ANALYSIS
5.1 Speech Frequency Analysis
The human speech frequency is less than 4000 Hz. Hence, as per the Nyquiest
theorem the minimum sampling rate should be 8000 Hz. The analysis of the human voice
was done on MATLAB before the design implementation. The data acquisition toolbox
in MATLAB was used to record six words for exactly two seconds at the sampling rate of
8000 Hz. The data acquisition tool takes in the analog input from the microphone
connected to the computer and sample the speech as per the sampling frequency required
by the user. Data acquisition tool supports the minimum sampling frequency of 8000 Hz.
Thus each two seconds word contained 8000 samples per second. Following were the
MATLAB commands used to record one second words.
duration = 2; % two second recording period
ai = analoginput('winsound');
addchannel(ai, 1);
sampleRate = get(ai, 'SampleRate')
get(ai, 'SamplesPerTrigger')
requiredSamples = floor(sampleRate * duration);
set(ai, 'SamplesPerTrigger', requiredSamples);
waitTime = duration * 1.1 + 0.5 % Buffer for the response time of the hardware
start(ai)
tic
33
wait(ai, waitTime);
toc
[data, time] = getdata(ai); % retrieving data from the getdata
Figure5.1: Signal Variation of “Hello”
Figure 5.1 and 5.2 shows the signal variation of words “Hello” and “One”
respectively for a period of two seconds. The y axis in the figures shows signal strength
in volts and the x axis shows the time duration of 2 seconds.
34
Figure5.2: Signal Variation of “One”
5.2 Filter Design and Fingerprint Generation Analysis
The analysis of filter design and Fingerprint generation is explained in the
following two sub sections.
5.2.1 MATLAB Filter Design and Fingerprint Generation
The filter design was tested in MATLAB before it was implemented in the
hardware. Eight band pass filters ranging from 200 Hz to 1800 Hz were used. The
coefficients of the filter were obtained using the following commands in MATLAB:
[B, A] = cheby2 (4, 40, [Freq1, Freq2])
After the calculation of the coefficients, the 8000 samples per second of the words
were down sampled to 4000 samples per second. The reason is the sampling frequency of
the system is 4000 Hz. The 4000 samples of each word were divided into 16 groups and
35
each group containing 250 samples. This was done to pass each sample of the word from
the filter and generated 16 data points or fingerprints that define a particular word. The
16 groups were passed through these eight filters. The output of a filter for each sample
was squared and added to the previous value for one group that is for 250 samples. Thus,
a finger print is generated for that group. This is how each filter generated 16 fingerprints
for each word. The fingerprint accumulation process is shown in the figure 5.3.
Figure 5.3: Fingerprint Accumulation after Every 250 Samples for a Word
The following MATLAB commands were used to generate fingerprints:
l1=length(x);
x1=resample(x,4000,l1); % Resampling to 4000 samples.
y(:,1)=x1;
while i<5 % Count for using only four words.
count = 1:1:250;
36
u1 = y(count,i);
% Dividing into groups of 250 (250 * 16 = 4000 samples)
% segregating the 4000 samples into groups of 250 continues
count = 3751:1:4000;
u16 = y(count,i);
output1 = filter(B1, A1, u1);
% Passing each sample through the filter
% collecting outputs of the filter for each sample continues
output16 = filter(B1, A1, u16);
ot1=output1.*output1;
%Squaring the outputs of the filter.
% squaring of the outputs of the filter continues
ot16=output16.*output16;
result1 = sum(ot1);
% Adding all the squared results.
% accumulation of the results continues
result16 = sum(ot16);
% concatenating the 16 fingerprints for this filter
fingerprint_filter1(i,:) = [result1, result2, result3, result4, result5, result6, result7, result8,
result9, result10, result11, result12, result13, result14, result15, result16];
The total number of fingerprints generated by the 8 band pass filters is 128. These
fingerprints can be plotted on a graph and the difference between the words can be
judged. Figure 5.4 shows the fingerprints of the word “Back” and “History”.
37
Figure5.4: Fingerprints of Words “Back” and “History”
5.2.2 Actual Filter Design and Fingerprint Generation
The actual filter design was based on Prof Land’s optimized implementation of
the second order IIR filter on AT Mega 32 microcontroller. In this case the 4 th order
Chebychev filter is designed by cascading two second order Chebychev filters. The
coefficients of the second order filter were obtained by the following MATLAB
command:
[B,A] = cheby2(2,40,[Freq1, Freq2]);
The coefficients of the second 2nd order filters were obtained by the following
command:
[sos1, g1]=ft2sos (B1, A1,'up', 'inf');
The coefficients obtained from the above commands were floating point values.
The actual design implementation was based on fixed point arithmetic. The coefficients
were converted into floating point by multiplying each number by 256 and then rounding
it off to the nearest integer rather than using a float 2 fix macro. The process is based on
38
Tor’s algorithm. A table 5.1 and 5.2 shows first and second 2nd order coefficients
respectively for band pass filter 200–400 Hz.
SOS Floating Point Filter
SOS Fixed Point Filter
Rounding off to Nearest
Coefficient
Coefficient
Integer
(Multiplying with 256)
1.7613
1.7613 *256 = + 450.8928
451
-0.9700
-0.9700 *256 = -248.32
-248
0.0816
0.0816 *256 = 20.8896
21
-0.1233
0.1233 *256 = -31.5648
-32
0.0816
0.0816 *256 = 20.8896
21
Table 5.1: Band Pass Filter 200-400 Hz First 2nd Order Coefficients
SOS Floating Point Filter
SOS Fixed Point Filter
Rounding off to Nearest
Coefficient
Coefficient
Integer
(Multiplying with 256)
1.7903
1.7903 *256 = + 458.3168
458
-0.968
-0.968 *256 = -247.808
-248
8.6923
8.6923 * 256 = 2225.22
2225
-16.7363
-16.7363 *256 = - 4284.492
-4285
8.6923
8.6923 * 256 = 2225.22
2225
39
Table 5.2: Band Pass Filter 200-400 Hz Second 2nd Order Coefficients
Each data point was generated after accumulating the outputs from each filter for
125 samples instead of 250 samples. The reason is the limitation of random access
memory of 2KB in AT Mega32 microcontroller. The 2000 samples half a second requires
almost 2KB of storage when passed from the eight filters. The total number of data points
created by all the filters was 128. So for this purpose 2000 samples half a second were
taken instead of the actual 4000 samples per second.
5.3 Fingerprint Comparison Analysis.
All the samples of a word pass through the eight filters and the output of each
filter is the accumulation of the 250 consecutive outputs square of the filters. All the
words have different frequency spectrum. The same words should have the same
frequency spectrum but it was seen during the simulations that even a same word at
different intervals of time does not have the same fingerprints. Figure 5.5 shows the
fingerprints of the word “Hello” spoken at different intervals of time and the same
fingerprint varies with respect to time which is spoken at different intervals of time.
Figure5.5: Fingerprints of the Word “Hello at Different Intervals of Time”
40
The relevant information of a word was stored in the fingerprint. Once we stored
the fingerprints of the three words “Hello”, “Uno” and “One” in the EEPROM of AT
Mega32, the spoken words were compared against the stored words. For the comparison
purpose, a function was called which did the pseudo Euclidean distance calculations
between the values of the fingerprints stored in the EEPROM and fingerprint of the
spoken word. The function went through all the three words in the EEPROM for the
comparison and the picked the one with the smallest calculated distance. Another set of
words that were tried for the voice recognition was “back”, “history” and “run”. The
following commands were used for the MATLAB simulations to calculate the Euclidean
distance between the words:
%Calculating Euclidean distance
d1=sum(abs(fingerprint(:,1)-fingerprint(:,2))); % distance between back.wav and
history.wav.
d2=sum(abs(fingerprint(:,2)-fingerprint(:,3))); % distance between history.wav and
hello22.wav.
d3=sum(abs(fingerprint(:,1)-fingerprint(:,3))); % distance between back.wav and
hello22.wav.
d4=sum(abs(fingerprint(:,3)-fingerprint(:,4))); % distance between hello22.wav and
hello33.wav.
The simulations results showed that the distance between the Hello22 and
Hello33 was the minimum and the same was verified on the hardware implementation of
the Euclidean distance.
41
Chapter 6
CONCLUSION
The implementation of the hardware based real time voice recognition system was
successful. The project was 80% success as it can recognize particular combinations of
three words. The user can train the system in the beginning for the three words of his/her
choice and the fingerprints (frequency samples) are stored in the EEPROM of AT
Mega32 controller. Each of the fingerprints takes around 300 Byte of space which makes
and due to the limited EEPROM space of the microcontroller 1 KB, it was only possible
to store fingerprints of just three words.
The algorithm used for the voice recognition is effective and is able to recognize
the words like “Back”, “History” and “Run”. The other combination of three words is
“Hello”, “One” and “Uno”. The limitation of the system is that it can recognize only few
combinations of the words. The Euclidean distance was calculated for the words to be
recognized and the stored words and the corresponding RED blinked for the minimum
distance. The calculated distance for the words was seen on the hyper terminal as well.
The results were verified effectively in the practical way. The figure 6.1 shows the screen
shot from the hyper terminal. The voice recognition system can be elaborated and used in
various applications like voice security system, voice controlled car navigation system
etc.
42
Figure6.1 – Hyper Terminal Screen Shot
43
APPENDIX
Source Code
////******************************************************************////
//// Function Name
//// Description
////
: void main(void)
////
This is the main function where it will initialize all peripherals, ////
then start reading the samples from ADC and compare with EEPROM data
////
//// if the new word is near to the stored word in the EEPROM
////
//// using Euclidian formula, it will show nearest match
////
////*****************************************************************////
void main(void)
{
Peripheral_init();
// Initialize all peripherals
pin=PINC;
PORTC=pin;
if(PORTC.7 == 0)
// if port pint is 0, it will erase the EEPROM
{
sflag=0xff;
ENWORD=0;
// Flag in the EEPROM is getting erased
// Word count in the EEPROM is getting erased
printf(" \n\t ******** Clearing of EEPROM Memory Done******** \n\r ");
}
if (sflag == 0xAA)
three
// There are valid words available in the EEPROM and count is
44
{
printf(" Voice Recognition Mode ");
PORTB.0=0;PORTB.1=0;PORTB.2=0;
}
else
// if less than three words are avaialbe in EEPROM, display the same
using LEDs
{
printf(" Learning Mode ");
if(ENWORD==0)
{
PORTB.0=1;PORTB.1=1;PORTB.2=1;
}
else if(ENWORD==1)
{
PORTB.0=0;PORTB.1=1;PORTB.2=1;
}
else if(ENWORD==2)
{
PORTB.0=0;PORTB.1=0;PORTB.2=1;
}
}
while(1)
// infinite loop
45
{
if (latest_sample == 1)
// Check whether to read samples from ADC,
latest_sample is set to 1, from timer ISR.
{
In_word();
// Check the ADC input data is greater than thresh value (word
detected),
// if greater, set "flag2 =1"
if(flag2 == 1) // there is a word begining detected from ADC
{
input = (((int)adc_in) -103);
// Attenuate the default DC offset from ADC
by 102 and assign to input variable to prvent overflow of input data as per fixed point
arithematic.
f1 = iir2_1(input);
f2 = iir2_2(input);
f3 = iir2_3(input);
f4 = iir2_4(input);
f5 = iir2_5(input);
f6 = iir2_6(input);
f7 = iir2_7(input);
f8 = iir2_8(input);
s_count++;
if(s_count == 125)
// Increment the sample counter
46
{
// Check whether 125 samples have passed
through this iteration
g_count++;
// Take (125 samples) 16 times to get
2000 sample : 16x 125 = 2000 samples
s_count = 0;
// Reseting the fingerprint sample counter.
j = 0;
Group_sample[i][j++] = f1; Group_sample[i][j++] = f2;
Group_sample[i][j++] = f3; Group_sample[i][j++] = f4;
Group_sample[i][j++] = f5; Group_sample[i][j++] = f6;
Group_sample[i][j++] = f7; Group_sample[i][j++] = f8;
f1 = 0;f2 = 0;f3 = 0; f4 = 0; f5 = 0; f6 = 0; f7 = 0; f8 = 0; // clearing for
next inputs
i++;
}
if(g_count == 16)
{
g_count = 0;
s_count = 0;
flag2 = 0;
i = 0;
j = 0;
for(m = 0; m < 8; m++)
47
{
for(l = 0; l < 16; l++)
{
cur_fing[k] = Group_sample[l][m];
// Storing the data
to the one diamentional array
k++;
}
}
k = 0;
if(sflag==0xAA)
// if there are words available in the
EEPROM, compare with it
{
Compare();
// comparing the current samples with stored
samples in EEPROM
}
else
// writing the samples to eeprom and incriment
the word count
{
if(ENWORD > 3) // if word count is garbage or FF, making
it to zero
ENWORD=0;
48
for(o=0; o < 128; o++ )
{
saved_fing[ENWORD][o]=cur_fing[o];
//Writing
data to EEPROM
}
if((ENWORD+1)==3)
// If all three words have writtern, make the flag in
the EEPROM as 0xAA at location 0x00
{
sflag=0xAA;
ENWORD = ENWORD + 1; // Incrimenting the word
count in EEPROM
}
else
ENWORD = ENWORD + 1; // Incrimenting the
word count in EEPROM
printf(" Data Stored %d ",ENWORD); printf( "\n\r");
if(ENWORD==0)
for word count
{
// Driving correspoding green LEDs
49
PORTB.0=1;PORTB.1=1;PORTB.2=1;
}
else if(ENWORD==1)
{
PORTB.0=1;PORTB.1=1;PORTB.2=0;
// 1st Green LED
blinking.
}
else if(ENWORD==2)
{
PORTB.0=1;PORTB.1=0;PORTB.2=0;
// 1st and 2nd Green
LEDs blinking.
}
else if(ENWORD==3)
{
PORTB.0=0;PORTB.1=0;PORTB.2=0;
// 1st, 2nd and 3rd Green
LEDs blinking.
}
}
} // if(g_count == 16){
} //if(flag2 == 1)
latest_sample = 0; // Re initialize to 0 for next ADC conversion.
50
} //latest_sample
} // while(1)
//main
}
51
REFERENCES
[1]
Jim Baumann, “Voice Recognition”, Human Interface Technology Laboratory,
fall 1993
[2]
Richard L. Klevans, Robert D. Rodman, “Voice Recognition”, Artech House on
Demand, 1997
[3]
Ton Kalker and Jaap Haitsma, “A Review of Algorithms for Audio
Fingerprinting”, fall 2000
[4]
Yoshikazu
Miyanaga,
“Robust
Speech
Recognition
and
its
ROBOT
Implementation”, Hokkaido University, 2009
[5]
“Atmel Mega32 PU datasheet”, Atmel, Available: http//www.atmel.com
[6]
Zhu Liu and Yao Wang, “Audio Feature Extraction and Analysis for Scene
Segmentation and Classification”, Polytechnic University, 1999
[7]
Haitsma, Kalker, Oostveen, and Philips Res, “An Efficient Database Search
Strategy for Audio Fingerprinting”, Multimedia Signal Processing, 2002 IEEE
Workshop
Download