Proposal Document

advertisement
Yousef Rabah
Page 1
2/18/2016
Speech Recognition Application:
Voice Enabled Phone Directory
Introduction:
Speech Recognition is seen as one of the promising market technologies of the
near future. As an example, different companies such as Advanced Recognition
Technologies, Inc (ART), Microsoft, as well as other open source community based
companies have been integrating/ implementing speech recognition systems in their
software. These voice command based applications will be expected to cover many of the
aspects of our daily from telephones to the Internet. Voice command based applications
will make life easier due to the fact that people will get easy and fast access to
information.
It is important to understand the process of speech recognition in order to be able
to implement/integrate it into different applications. First of all, there are two types of
speech recognition. The first is a ‘speaker dependent system’ that is designed for a single
speaker, it is easy to develop while it is not flexible to use. The second is a ‘speaker
independent system’ designed for any speaker. It is harder to develop, less accurate and
more expensive than the ‘speaker dependent’ system, yet it is more flexible.
The vocabulary sizes of the Automatic Speech Recognition (ASR) system range
from a small vocabulary that would consist of two words to a very large vocabulary that
consists of tens of thousands of words. The size of the vocabulary affects the complexity,
processing requirements, and the accuracy of the ASR system. There are two types of
vocabularies: the first is an isolated system that uses a single word - either a full word or
a letter - at a time. It is the simplest type because it is easy to find the starting and ending
Yousef Rabah
Page 2
2/18/2016
points of a word. Were as in the second type (the continuous system) this would be much
harder because we are actually using sentences.
There are a number of different objects that could affect an ASR system, such as
pronunciation and frequency. The speaker's current mood, age, sex, dialect, inflexions
and background noise could affect the accuracy and performance of such system. It is
thus necessary for the system to bypass these obstacles in order for it to be more accurate.
As an example, the system could use filters to solve some of these problems like
background noises, coughs, heavy breath and etc. Therefore, in most systems filtering is
the first stage of speech analysis, where speech is filtered before it arrives to the
recognizer process of speech. The process of speech requires analog-to-digital conversion
in which the voice's pressure waves are converted in to numerical values in order to be
digitally processed.
The Hidden Markov Model is a Markov Chain where the output symbols or
probabilistic functions that describe them, are connected either to the states or to the
transitions between states. The algorithm consists of a set of nodes that are chosen to
represent a particular vocabulary. These nodes are ordered and connected from left to
right, and recursive loops are allowed. Recognition is based on a transition matrix of
changing from one node to another. Each node represents the probability of a particular
set of coeds. Here is a figure that would show the functionality of ASR:
(11)
Yousef Rabah
Page 3
2/18/2016
The HMM ‘Q’ is referred to often as a parametric model because the state of the
system at each time t is completely described by a finite set of parameters. The algorithm
that estimates the HMM parameters (training algorithm) takes a first good guess. Then
the initialization model performs another guess and it computes the HMM parameters
first guess using the preprocessed speech data (features) with their associated phoneme
labels. The HMM parameters are kept or stored as files and then retrieved by the training
procedure. Here is a figure that shows the initialization in ASR:
(11)
Before estimating the HMM parameters, the basic structure of HMM must be defined. To
be specific, the graph structure, which is the number of states and their connections, and
the number of mixtures per state M, must be specific. A good way to understand HMM is
by giving an example. If we build a model Q that
recognizes only the word “yes”, then the word is
composed of the two phonemes ‘\ye’ and ‘\s’. This
corresponds to the six states of the two phoneme
models. To be more accurate “yes” is composed of ‘\y’
‘\eh’ ‘\s’. The ASR would not know the acoustic state
in mind of the speaker, therefore the ASR system would try to find W by reconstructing
the more likely sequences of states and words W that have generated X.
Yousef Rabah
Page 4
2/18/2016
Model training is performed by estimating the HMM parameters, since estimation
accuracy is roughly proportional to the number of training data. The HMM is well suited
for a speaker-independent system because the speech used during training uses
probabilities or generalizations and that makes it a good system to use for multiple
speakers.
It is important to keep in mind the E-set that includes: b, c, d, e, g, p, t, v, z,
because when you are trying to pronounce these letters, they actually sound the same.
Thus it is important to keep them in mind when dealing with issues like pronunciation.
Statement of the Problem:
The focus of my project is based on having automatic speech interacting phone
directory assistance without having human interaction. It is hard to find a speech based
command system that calls numbers for you because of all the complications that I have
mentioned above.
Proposed Solution
My solution consists of three parts and I will go through them and explain my
approaches and what I would like to acquire out of each. I will then demonstrate how
they all play a part in the final configuration. Here is a diagram that will show the
overview of the models, and the next paragraphs are its explanations:
Sphinx:
The first part needed is an ASR system that I
would be able to work with in order to build my speech
enabled phone directory. I need a speaker independent
Yousef Rabah
Page 5
2/18/2016
system based on HMM that has a large vocabulary. After researching the matter, I have
decided to use sphinx, based from Carnegie Mellon University, for my ASR system.
In sphinx, basic sounds in the language are classified into phonemes or phones.
The phones are distinguished according to their position within the word (Beginning, end,
internal, or single) and they are further refined into context-dependent triphones. The
building processes of acoustic models are through the triphones. Triphones are modeled
by HMM and usually contain three to five states. The HMM states are clustered into a
much smaller number of groups called senone.
The input audio is of 16 bit samples, from 8 to 16 Mhz, which is of a .raw type.
Training consists of having good data that consists of spoken text or utterances. Each
utterence:
-
Is converted into leaner sequences of triphone HMM’s using pronunciation
lexicon.
-
Finds best state sequence or state alignment through HMM
For each senone, all frames are gathered in the training and are mapped in order to build
suitable statistical models. The language model consists of:
-
Unigrams where the entire set of words and their individual probablilities of
occurrences in language, are considered
-
Bigrams: the conditional probability that word2 immediately follows word 1
in the language.
-
Information for a subset of possible word pair.
It also contains the Lexicon Structure, which is the pronunciation dictionary. It is a file
which specifies word pronunciation. Pronunciations are specified as linear sequences of
phones. Also, it is essential to know that there are multiple pronunciations for the same
Yousef Rabah
Page 6
2/18/2016
word or letter. It also includes a silence symbol <sil> to represent the user’s silence. As
an example, ‘ZERO’ is pronounced ‘Z IH R OW’.
Database (ADB)
The second item needed for my project is to build a database. I decided to use
PostgeSQL for this part. The database, named ADB, will contain a “People” entity,
which contains these attributes:
1. pid: which is an attribute that contains the unique identification for each, and is of
type integer.
2. first_name: attribute that contains the first name of a person and is of type
varchar(20)
3. last_name: attribute that contains the last name of a person and it is also of type
varchar(20).
4.phone_number: attribute that contains phone number and it is also of type
varchar(12) UNIQE (which means that system would not accept the same number
more than once).
5. city: attribute that contains city name and its type is varchar(15).
 The primary key is (pid, first_name, last_name)
Here is an example of what the Database contains:
Pid | first_name | last_name | phone_num | city
------+----------------+---------------+--------------------+-------------1 | Sam
| Smith
| 765-973-2743 | Ramallah
2 | George
| Adams
| 765-973-2741 | Richmond
3 | Sam
| Knight
| 765-973-2222 | Houston
4 | Kathrin
| Smith
| 765-973-3343 | Jerusalem
5 | Samer
| Abdo
| 765-973-2190 | Jacksonville
The database’s function is for matching purposes. The database has the person’s
information and it will provide the data, which is needed for the phone directory. In other
words it will act as an address book but at the same time it can select information that
Yousef Rabah
Page 7
2/18/2016
will be needed by the application. For example, you can either select all names in the
directory, or you can select a specific person by first name or last name. Selecting and
inserting functions are important aspects of the database (ADB).
Application
The application is the third item in the deliverables and it is going to be one of the most
important ones. It will serve as the main connector between the ASR system (sphinx) and
the Database (ADB). Basically, the application will serve as easy communication through
sphinx and the database to send and receive information. As I said briefly before, through
sphinx, I will be able to get speech decoded, by through saying letters of first name, last
name. Once decoded, the application will communicate with the database (ADB). The
application’s functionality will include the:

Interface:
Connect to DB:
- Add person
- Delete person.
- Edit person.
- View (through different select statements, depending on what the user
wants). I will talk about viewing in the next section
* note: person here includes first name, last name, number
Connection with Sphinx to:
- Decode letters said by user
- Use log file generated to view and grep results
- Strip the silence symbols, as well as space between letters in order to put
together the decoded letters together, which forms the person’s first
name or last name as a “word”.
Communication with User:
Yousef Rabah
Page 8
-
2/18/2016
In order to insure that the decoder actually decoded the correct letters
that have been said by the user. He/she will be asked “did you say
‘word’”. The word does not have to be a completed word, it could be a
couple of letters from a word. As long as the user says the letters
needed, the application will run commands to connect to database and
then it will get the responded words. The user will be asked whether
he/she wants to get the ‘letters of a word’ as for the first name, or last
name. If for example the user gets a lot of names, the user has the
option to say more letters. These letters will be combined with the
previous letters and then the same operation happens. The application
will connect to ADB and it will get back results.
The application(s), as I said, will communicate with Sphinx and ADB. The programming
languages that I will use will include Perl, C, PHP and shell scripts.
Final Paper:
This part will consist of joining some of the proposal paper as well as including more info
on the applications that I want to build. It will also include the source files that are
included and reflect upon the results that I have gotten to. It could also include bugs and
enhancements that could be implemented at some other stage.
Yousef Rabah
Page 9
2/18/2016
Timeline:
Clarification of tasks:
- Reading: My aim was to go into a new field of automatic speech recognition. At first I
was reading general and background information on speech recognition, and then I
moved deeper in to subject. I read about different approaches and I also reviewed some
info on about Databases (PostgreSQL).
- Bibliography: The bibliography part consists of different journals, articles and books
that I have read or was reading at the time.
- Survey Paper & Presentation: This part consisted of background knowledge and
information on the subject. Working on this part of the paper and presentation gave me a
good idea of the content and understanding of the area.
- Find ASR system: During this time I was looking into different, mainly open source,
software that I would later build my application out of.
Yousef Rabah
Page 10
2/18/2016
- Test & Configure ASR : This part consisted of getting the software working. I chose
Sphinx ASR system. There were a lot of dependencies that I had to configure in order for
the software to work properly. I tested the system and tried to work with its language. I
have also tested the software through its raw files and then figured how to change the
.wav files into .raw files.
- Proposal Presentation: Involved setting up and getting different ideas into place. I took
the important parts of the survey that I needed to include in the proposal presentation. It
included what I wanted to work with, for example, the language and its model.
- Build Database: This part consisted of building the database. I chose to work with
PostgreSQL. I have to fulfill a couple of dependencies here as well. I created a database,
called ADB, and set up a table and tuples. I also inserted over 9 entries of different
people. It included first name, last name, phone number, and city. I also wrote different
selects\ statements that I will use in the application that I am building.
- Proposal Paper: This part consisted of writing the proposal paper. It included what I
presented in the proposal presentation and, at the same time, I wrote more details about
the application process building.
- Build Applications: Consists of implementing the application part, which will join the
ASR system (Sphinx) and the PostgreSQL ADB database. I will use C, Perl, PHP, and
shell scripting for writing the applications.
- Test Applications: This part will consist of testing the applications that I wrote about in
the paper. This would include different bugs, and/or enhancements that could be later
implemented to enhance the program.
Yousef Rabah
Page 11
2/18/2016
- Final Paper & Revisions: This part will consist of joining some of the proposal paper,
and it would include more about the applications that I built, as well as the source files
that are included. It will also reflect upon the results reached, bugs and enhancements that
could be implemented at some other stage.
- Colloquium Preparation: Consist of preparing the presentation for the colloquium. It
would include some of the proposal presentation but will go further to talk about
applications, as results and the reported bugs in the applications.
- Colloquium: Making final thoughts and preparation for presenting the colloquium.
Yousef Rabah
Page 12
2/18/2016
Bibliography
White, George M. "Natural Language understanding and Speech Recognition."
Communications of the ACM 33 (1990): 74 - 82.
Osada, Hiroyasu. "Evaluation Method for a Voice Recognition System Modeled with
Discrete Markov Chain." IEEE 1997: 1 - 3.
Bradford, James H. "The Human Factors of Speech-Based Interfaces: A Research
Agenda." SIGCHI Bulletin 27 (1995): 61 - 67.
Shneiderman, Ben. "The Limits of Speech Recognition." Communication of the
ACM 43 (2000): 63 - 65.
Danis, Catalina, and John Karat. "Technology-Driven Design of Speech Recognition
Systems" ACM 1995: 17 - 24.
Suhm, Bernhard, et al. "Multimodal Error Correction for Speech User Interfaces"
ACM Transactions on Computer-Human Interaction 8 (2001) 60 - 98.
Brown, M.G., et al. "Open-Vocabulary Speech Indexing for Voice and Video Mail
Retrieval" ACM Multimedia 96 1996: 307 - 316.
Christian, Kevin., et al. "A Comparison of Voice Controlled and Mouse Controlled
Web Browsing" ACM 2000: 72 - 79.
Falavigna, D., et al. "Analysis of Different Acoustic front-ends for Automatic voice
over IP Recognition" Italy 2001.
Simons, Sheryl P. "Voice Recognition Market Trends" Faulkner Information Services
2002.
Yousef Rabah
Page 13
2/18/2016
(11) Becchetti, Claudio, and Lucio Prina Ricotti. Speech Recognition: Theory and
C++ Implementation. New York : 1999
Abbott, Kenneth R. Voice Enabling Web Applications: VoiceXML and Beyond. New
York: 2002
Miller, Mark. VOiceXML: 10 Projects to Voice Enable Your Web Site. New York:
2002
Syrdal, A., et. al. Applied Speech Technology Ann Arbor: CRC 1995
Larson, James A. VoiceXML:Introduction to Developing Speech Applications New
Jersey : 2003
Download