VIRS API - Dec1411 Senior Design Project

advertisement
VIRS
VIRS (VIRS IS RECOGNIZING SPEECH) DESIGN DOCUMENT
A SPEECH RECOGNITION PROJECT FOR VIRTUTRACE
James Boddie, Karl Akert, Geoffrey Miller, Colleen Riker, Michael Potter
CPRE / SE 491 , LAST MODIFIED MARCH 13, 2014
Contents
Project Statement ......................................................................................................................................... 2
Deliverables................................................................................................................................................... 2
Specifications ................................................................................................................................................ 2
Terminology .............................................................................................................................................. 2
Software Coding Standards....................................................................................................................... 3
Frameworks, IDEs, and Other Software Used ........................................................................................... 3
Operating Environment ............................................................................................................................ 4
System Level Design ...................................................................................................................................... 4
System Requirements ............................................................................................................................... 4
Functional Decomposition ........................................................................................................................ 4
System Analysis & Concept Designs.......................................................................................................... 6
Detail Project Description ............................................................................................................................. 8
I/O Specification ........................................................................................................................................ 8
Interface Specification .............................................................................................................................. 8
VIRS API ......................................................................................................................................................... 8
Init() ........................................................................................................................................................... 8
LoadCommands(Dictionary<Command,Function>).................................................................................. 9
RemoveCommands(list commands) ......................................................................................................... 9
InterpretCommand(RawAudioType audioFile) ......................................................................................... 9
CommandHistory(TimeSpan timeSpan).................................................................................................... 9
Hardware / Software Specification ........................................................................................................... 9
Implementation Issues & Challenges ........................................................................................................ 9
Testing, Procedures, & Specifications ....................................................................................................... 9
Conclusion ..................................................................................................................................................... 9
Appendix A (Outside APIs and Documents) ................................................................................................ 10
Appendix B (References) ............................................................................................................................. 10
1|Page
Project Statement
Currently in the C-6 facility there is a program called VirtuTrace where 3-dimentional situational
programs are used to enhance training for members of the local fire departments. Within this
program users must gather information and the current system uses a pop-up decision matrix
in which they point and click to make decisions. This system is efficient, but it is not very
immersive to the user. Our goal will be to implement a discrete voice command system (VIRS)
that is robust enough to interpret natural language and return the requested information back
to the user quickly in an audio format.
Deliverables
Logger: There needs to be a system that records the history of commands executed by the
operator and the responses given by system.
Speech Recognition Interface: The speech recognition will be designed separately from the
VirtuTrace project; therefore, it will need an easy API and callable methods available.
Developing it separately would allow the software to adapt to changes and be useable for other
systems.
Speech-to-Text Translator: The speech recognition software will need to be able to convert
spoken word into a textual format easily interpreted by the software.
Command Interpreter: This deliverable is the feature needed that can convert textual
information created by the translator to usable commands.
Decision Processor: This portion of the speech recognition software needs to be able to
interpret a series of commands either predetermined or new and react back to the user in a
meaningful way. A meaningful way is an audio output response that gives the user choices,
describes their situation, or updates the program based off a user vocal command.
Specifications
Terminology
API: (Application programming interface) An interface of software library methods and specifics
on how methods interact with one another.
Phone: An unanalyzed sound in a language; similar classes of sounds. (Appendix B-1)
Diphone: An adjacent pair of phones; part of phones between two consecutive phones.
(Appendix B-1)
Senone: Phones considered by context (i.e. these build sub words like syllables); also called
reduction-stable entity. When speech becomes fast, phones change but syllables stay the same.
(Appendix B-1)
2|Page
Filler: Non-linguistic sounds like breath, coughs, “um,” “uh,” and background sounds. (Appendix
B-1)
Utterance: Separate chucks of audio between pauses. These are formed by using fillers as
delimiters. (Appendix B-1)
Feature: Numbers calculated from speech by dividing speech on frames. (Appendix B-1)
Feature Vector: An n dimensional vector of features containing the numerical representation of
the captured speech. (Appendix B-1)
Model: Describes some mathematical object that gathers common attributes of a spoken word.
(Appendix B-1)
Acoustic Model: Acoustic properties for each senone. Most probable feature vectors for each
phone. The acoustic model is built from senones with context. (Appendix B-1)
Phonetic Dictionary: Contains the mapping from words to phones. Usually not very effective
(e.g. only two to three pronunciation variants are usually noted but practical most the time).
This dictionary can be replaced by machine learning algorithms. (Appendix B-1)
Language Model: Used to restrict word searches. Defines which work could follow previously
recognizes words. Ideally this should be very good at guessing the next word based off a
sequence of previous words. (Appendix B-1)
Software Coding Standards
Throughout the speech recognition software there is many standards that must be adhered to.
VirtuTrace Coding Standards: All standards for working with the VirtuTrace project can be
found at Appendix A-2.
CMU Sphinx: All standards and libraries for working with CMU Sphinx can be found at
Appendix A-1.
Festvox: All standards and libraries for working with Festvox can be found at Appendix A-3.
Group Standards:
1. At the method level comment about what the method does and specify parameters
2. Each method created must have a unit test to check functionality
3. General comments must explain why an algorithm is written a certain way and what it
does
Frameworks, IDEs, and Other Software Used
Ubuntu: The operating system VirtuTrace and the text editors run in
C++: The programming language for writing speech recognition software
3|Page
BitBucket (Git): The hosting site for the Git repository
Vim: Text editor running in Linux
Gedit: Text editor running in Linux
Sublime: Text editor running in Linux
VirtuTrace: Software to run virtual environments to be used to study decision making in
stressful situations
Sphinx: Carnegie Mellon University’s speech recognition code base and suite of tools
Festvox: Voice synthesizing software
Figure 1
Operating Environment
As described in the section above, the VIRS software will only compile and run under Linux operating
systems (specifically Ubuntu). The API that is supplied will be in C++ only. It will be designed to be
independent of the VirtuTrace software, but considering it is initially for VirtuTrace, many of the
methods may only work under that environment.
System Level Design
System Requirements
VirtuTrace can only run on a Linux operating system, but technically Ubuntu is the easiest to
use. A computer with a text editor is needed for writing code.
Functional Decomposition
The main purpose of the VIRS speech recognition software is to take audio input in from a
microphone, interpret it, and respond to the user. The software has many functional
requirements for this process to be successful:
4|Page
Converting audio to textual information: This functional requirement will be met by using CMU
Sphinx libraries. This is accomplished by using the waveform received from the microphone and
splitting it into utterances by using fillers as delimiters. Then each utterance is interpreted using
a combination of the acoustic model, phonetic dictionary, and the language model.
Interpret sub words: VIRS will need to be very responsive to a user’s verbal commands. In able
to meet this requirement, the VIRS software will need to be able to detect sub words and form
words and phrases of them as the input retrieves them. Luckily the CMU Sphinx libraries
contain methods that make this process trivial and VIRS will take advantage of their libraries.
Converting textual information to audio: VIRS handles this functional requirement by
implementing a wrapper that uses the public Festvox software created at Carnegie Melon.
Adding in commands: This functional requirement relates to adding new custom commands for
the system to interpret. As an example, a system must be able to recognize the command
“move right,” and then an action must occur. This functionality will be handles under the VIRS
API.
Dynamically remove a command: There may be changing scenes or other instances when a
command is not to be interpreted (e.g. a command is outdated and not used). When this occurs
the software will need to remove it from the possibilities dynamically. This frees up memory
and speeds up the translating process. This will be a feature handled in the VIRS API.
Retrieve current decisions/choices available in audio/text format at any time: VIRS needs to
be able to take in input at any time in an audio format. In the early development of the
software we may take in a button press to specify when input is being accepted, but in the long
run the software should know by separating silence from utterances. This can be accomplished
by using the CMU Sphinx libraries.
The commands sent to the system and responses by the system must be logged for diagnostic
purposes: All of the input commands and responses sent out from VIRS must be logged for
diagnostic, testing, and to possibly save states of interaction with the system. These logs will be
saved in an XML flat file with the commands and their time stamp.
5|Page
System Analysis & Concept Designs
Figure 2. Sequence diagram of init()
Figure 3. LoadCommands sequence diagram
6|Page
Figure 4. InterpretCommand sequence diagram
7|Page
Figure 5. System block diagram for VIRS
Much of the design of the VIRS software was based off of strategies and demonstrations given and
described at Appendix B-2.
Detail Project Description
I/O Specification
Input: Input is received from a microphone headset and passed to VIRS. Additionally input
from a handheld controller to enable a push to talk mode.
Output: Output is sent to a headset and/or the C6 speakers. Additional output will be sent to
the VirtuTrace logging system.
Interface Specification
VirtuTrace interfaces with VIRS by using the VIRS Speech API as specified below
VIRS API
Init()
Initializes VIRS components starting the speech recognition
8|Page
LoadCommands(Dictionary<Command,Function>)
Loads the commands and the functions to be called when a command is interpreted
RemoveCommands(list commands)
Removes the given list of commands from the list of recognizable commands.
InterpretCommand(RawAudioType audioFile)
Takes a raw audio file and returns the matching command.
CommandHistory(TimeSpan timeSpan)
Returns the history of which commands have been recognized by VIRS.
NOTE: More will be added to this document as required throughout implementation
Hardware / Software Specification
A microphone is required that with proper noise cancelling techniques it will pick up speech.
Speakers are required for outputting audio to the user.
The software will need to be able to run within the VR Juggler environment.
Implementation Issues & Challenges
So far multiple errors and issues have been encountered when installing VirtuTrace; it is
possible that additional errors will occur when testing begins, and a VIRS prototype is made.
Each team member is to work on a separate part of the prototype, so getting the different
chunks of code to mesh will be a challenge, as well. Being able to accurately interpret speech is
a huge concern and a very difficult problem. Everything will have to be done on approximation,
which is one limitation. Considering time constraints, outside libraries will have to be used, so
implementation is limited to the current functionalities of these libraries.
Testing, Procedures, & Specifications
Procedures: Unit tests must be written for each Module before the Modules are pushed to the
master branch of the remote repository. Each unit test created for individual methods will be
stored in the VIRS code base for future regression-type testing as new functionalities are added
to the VIRS software.
Specifications: Performance tests for full speech process to synthesis must be done in at least
one second.
Testing: Module testing will be done and written by the developers working on the module.
Conclusion
VirtuTrace is software currently being used to simulate high stress situations in a virtual
environment and to study decision making in these situations. Within this program users must
gather information, and the current system utilizes a decision matrix where the user has to
9|Page
point and click on the element containing the decision they want. This system is efficient, but it
isn’t effective.
This speech recognition system will take in user speech and interpret it into a series of
commands. The system works through a series of modules that handle specific tasks related to
correlating speech to a decision matrix. The system will allow for the calling system to pass
audio and have a specified command take place. The using system will also be able to obtain a
decision tree of the past command requests.
Appendix A (Outside APIs and Documents)
1.
2.
3.
4.
CMU Sphinx (PocketSphinx) API - http://cmusphinx.sourceforge.net/doc/pocketsphinx/
VirtuTrace Guidelines - https://bitbucket.org/godbyk/virtutrace/wiki/Coding_guidelines
Festvox API - http://festvox.org/docs/manual-1.4.3/festival_28.html
VR Juggler - http://vrjuggler.org/documentation.php
Appendix B (References)
1. DBagnell, M_Chi199, Daktari, and Admin. "Basic Concepts of Speech." - CMUSphinx
Wiki. Carnegie Melon University, 30 July 2012. Web. 14 Mar. 2014.
<http://cmusphinx.sourceforge.net/wiki/tutorialconcepts>.
2. Glass, James, and Victor Zue. "Automatic Speech Recognition." MIT OpenCourseWare.
Massachusetts Institute of Technology, 2003. Web. 14 Mar. 2014.
<http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-345automatic-speech-recognition-spring-2003/>.
10 | P a g e
Download