VIRS VIRS (VIRS IS RECOGNIZING SPEECH) DESIGN DOCUMENT A SPEECH RECOGNITION PROJECT FOR VIRTUTRACE James Boddie, Karl Akert, Geoffrey Miller, Colleen Riker, Michael Potter CPRE / SE 491 , LAST MODIFIED MARCH 13, 2014 Contents Project Statement ......................................................................................................................................... 2 Deliverables................................................................................................................................................... 2 Specifications ................................................................................................................................................ 2 Terminology .............................................................................................................................................. 2 Software Coding Standards....................................................................................................................... 3 Frameworks, IDEs, and Other Software Used ........................................................................................... 3 Operating Environment ............................................................................................................................ 4 System Level Design ...................................................................................................................................... 4 System Requirements ............................................................................................................................... 4 Functional Decomposition ........................................................................................................................ 4 System Analysis & Concept Designs.......................................................................................................... 6 Detail Project Description ............................................................................................................................. 8 I/O Specification ........................................................................................................................................ 8 Interface Specification .............................................................................................................................. 8 VIRS API ......................................................................................................................................................... 8 Init() ........................................................................................................................................................... 8 LoadCommands(Dictionary<Command,Function>).................................................................................. 9 RemoveCommands(list commands) ......................................................................................................... 9 InterpretCommand(RawAudioType audioFile) ......................................................................................... 9 CommandHistory(TimeSpan timeSpan).................................................................................................... 9 Hardware / Software Specification ........................................................................................................... 9 Implementation Issues & Challenges ........................................................................................................ 9 Testing, Procedures, & Specifications ....................................................................................................... 9 Conclusion ..................................................................................................................................................... 9 Appendix A (Outside APIs and Documents) ................................................................................................ 10 Appendix B (References) ............................................................................................................................. 10 1|Page Project Statement Currently in the C-6 facility there is a program called VirtuTrace where 3-dimentional situational programs are used to enhance training for members of the local fire departments. Within this program users must gather information and the current system uses a pop-up decision matrix in which they point and click to make decisions. This system is efficient, but it is not very immersive to the user. Our goal will be to implement a discrete voice command system (VIRS) that is robust enough to interpret natural language and return the requested information back to the user quickly in an audio format. Deliverables Logger: There needs to be a system that records the history of commands executed by the operator and the responses given by system. Speech Recognition Interface: The speech recognition will be designed separately from the VirtuTrace project; therefore, it will need an easy API and callable methods available. Developing it separately would allow the software to adapt to changes and be useable for other systems. Speech-to-Text Translator: The speech recognition software will need to be able to convert spoken word into a textual format easily interpreted by the software. Command Interpreter: This deliverable is the feature needed that can convert textual information created by the translator to usable commands. Decision Processor: This portion of the speech recognition software needs to be able to interpret a series of commands either predetermined or new and react back to the user in a meaningful way. A meaningful way is an audio output response that gives the user choices, describes their situation, or updates the program based off a user vocal command. Specifications Terminology API: (Application programming interface) An interface of software library methods and specifics on how methods interact with one another. Phone: An unanalyzed sound in a language; similar classes of sounds. (Appendix B-1) Diphone: An adjacent pair of phones; part of phones between two consecutive phones. (Appendix B-1) Senone: Phones considered by context (i.e. these build sub words like syllables); also called reduction-stable entity. When speech becomes fast, phones change but syllables stay the same. (Appendix B-1) 2|Page Filler: Non-linguistic sounds like breath, coughs, “um,” “uh,” and background sounds. (Appendix B-1) Utterance: Separate chucks of audio between pauses. These are formed by using fillers as delimiters. (Appendix B-1) Feature: Numbers calculated from speech by dividing speech on frames. (Appendix B-1) Feature Vector: An n dimensional vector of features containing the numerical representation of the captured speech. (Appendix B-1) Model: Describes some mathematical object that gathers common attributes of a spoken word. (Appendix B-1) Acoustic Model: Acoustic properties for each senone. Most probable feature vectors for each phone. The acoustic model is built from senones with context. (Appendix B-1) Phonetic Dictionary: Contains the mapping from words to phones. Usually not very effective (e.g. only two to three pronunciation variants are usually noted but practical most the time). This dictionary can be replaced by machine learning algorithms. (Appendix B-1) Language Model: Used to restrict word searches. Defines which work could follow previously recognizes words. Ideally this should be very good at guessing the next word based off a sequence of previous words. (Appendix B-1) Software Coding Standards Throughout the speech recognition software there is many standards that must be adhered to. VirtuTrace Coding Standards: All standards for working with the VirtuTrace project can be found at Appendix A-2. CMU Sphinx: All standards and libraries for working with CMU Sphinx can be found at Appendix A-1. Festvox: All standards and libraries for working with Festvox can be found at Appendix A-3. Group Standards: 1. At the method level comment about what the method does and specify parameters 2. Each method created must have a unit test to check functionality 3. General comments must explain why an algorithm is written a certain way and what it does Frameworks, IDEs, and Other Software Used Ubuntu: The operating system VirtuTrace and the text editors run in C++: The programming language for writing speech recognition software 3|Page BitBucket (Git): The hosting site for the Git repository Vim: Text editor running in Linux Gedit: Text editor running in Linux Sublime: Text editor running in Linux VirtuTrace: Software to run virtual environments to be used to study decision making in stressful situations Sphinx: Carnegie Mellon University’s speech recognition code base and suite of tools Festvox: Voice synthesizing software Figure 1 Operating Environment As described in the section above, the VIRS software will only compile and run under Linux operating systems (specifically Ubuntu). The API that is supplied will be in C++ only. It will be designed to be independent of the VirtuTrace software, but considering it is initially for VirtuTrace, many of the methods may only work under that environment. System Level Design System Requirements VirtuTrace can only run on a Linux operating system, but technically Ubuntu is the easiest to use. A computer with a text editor is needed for writing code. Functional Decomposition The main purpose of the VIRS speech recognition software is to take audio input in from a microphone, interpret it, and respond to the user. The software has many functional requirements for this process to be successful: 4|Page Converting audio to textual information: This functional requirement will be met by using CMU Sphinx libraries. This is accomplished by using the waveform received from the microphone and splitting it into utterances by using fillers as delimiters. Then each utterance is interpreted using a combination of the acoustic model, phonetic dictionary, and the language model. Interpret sub words: VIRS will need to be very responsive to a user’s verbal commands. In able to meet this requirement, the VIRS software will need to be able to detect sub words and form words and phrases of them as the input retrieves them. Luckily the CMU Sphinx libraries contain methods that make this process trivial and VIRS will take advantage of their libraries. Converting textual information to audio: VIRS handles this functional requirement by implementing a wrapper that uses the public Festvox software created at Carnegie Melon. Adding in commands: This functional requirement relates to adding new custom commands for the system to interpret. As an example, a system must be able to recognize the command “move right,” and then an action must occur. This functionality will be handles under the VIRS API. Dynamically remove a command: There may be changing scenes or other instances when a command is not to be interpreted (e.g. a command is outdated and not used). When this occurs the software will need to remove it from the possibilities dynamically. This frees up memory and speeds up the translating process. This will be a feature handled in the VIRS API. Retrieve current decisions/choices available in audio/text format at any time: VIRS needs to be able to take in input at any time in an audio format. In the early development of the software we may take in a button press to specify when input is being accepted, but in the long run the software should know by separating silence from utterances. This can be accomplished by using the CMU Sphinx libraries. The commands sent to the system and responses by the system must be logged for diagnostic purposes: All of the input commands and responses sent out from VIRS must be logged for diagnostic, testing, and to possibly save states of interaction with the system. These logs will be saved in an XML flat file with the commands and their time stamp. 5|Page System Analysis & Concept Designs Figure 2. Sequence diagram of init() Figure 3. LoadCommands sequence diagram 6|Page Figure 4. InterpretCommand sequence diagram 7|Page Figure 5. System block diagram for VIRS Much of the design of the VIRS software was based off of strategies and demonstrations given and described at Appendix B-2. Detail Project Description I/O Specification Input: Input is received from a microphone headset and passed to VIRS. Additionally input from a handheld controller to enable a push to talk mode. Output: Output is sent to a headset and/or the C6 speakers. Additional output will be sent to the VirtuTrace logging system. Interface Specification VirtuTrace interfaces with VIRS by using the VIRS Speech API as specified below VIRS API Init() Initializes VIRS components starting the speech recognition 8|Page LoadCommands(Dictionary<Command,Function>) Loads the commands and the functions to be called when a command is interpreted RemoveCommands(list commands) Removes the given list of commands from the list of recognizable commands. InterpretCommand(RawAudioType audioFile) Takes a raw audio file and returns the matching command. CommandHistory(TimeSpan timeSpan) Returns the history of which commands have been recognized by VIRS. NOTE: More will be added to this document as required throughout implementation Hardware / Software Specification A microphone is required that with proper noise cancelling techniques it will pick up speech. Speakers are required for outputting audio to the user. The software will need to be able to run within the VR Juggler environment. Implementation Issues & Challenges So far multiple errors and issues have been encountered when installing VirtuTrace; it is possible that additional errors will occur when testing begins, and a VIRS prototype is made. Each team member is to work on a separate part of the prototype, so getting the different chunks of code to mesh will be a challenge, as well. Being able to accurately interpret speech is a huge concern and a very difficult problem. Everything will have to be done on approximation, which is one limitation. Considering time constraints, outside libraries will have to be used, so implementation is limited to the current functionalities of these libraries. Testing, Procedures, & Specifications Procedures: Unit tests must be written for each Module before the Modules are pushed to the master branch of the remote repository. Each unit test created for individual methods will be stored in the VIRS code base for future regression-type testing as new functionalities are added to the VIRS software. Specifications: Performance tests for full speech process to synthesis must be done in at least one second. Testing: Module testing will be done and written by the developers working on the module. Conclusion VirtuTrace is software currently being used to simulate high stress situations in a virtual environment and to study decision making in these situations. Within this program users must gather information, and the current system utilizes a decision matrix where the user has to 9|Page point and click on the element containing the decision they want. This system is efficient, but it isn’t effective. This speech recognition system will take in user speech and interpret it into a series of commands. The system works through a series of modules that handle specific tasks related to correlating speech to a decision matrix. The system will allow for the calling system to pass audio and have a specified command take place. The using system will also be able to obtain a decision tree of the past command requests. Appendix A (Outside APIs and Documents) 1. 2. 3. 4. CMU Sphinx (PocketSphinx) API - http://cmusphinx.sourceforge.net/doc/pocketsphinx/ VirtuTrace Guidelines - https://bitbucket.org/godbyk/virtutrace/wiki/Coding_guidelines Festvox API - http://festvox.org/docs/manual-1.4.3/festival_28.html VR Juggler - http://vrjuggler.org/documentation.php Appendix B (References) 1. DBagnell, M_Chi199, Daktari, and Admin. "Basic Concepts of Speech." - CMUSphinx Wiki. Carnegie Melon University, 30 July 2012. Web. 14 Mar. 2014. <http://cmusphinx.sourceforge.net/wiki/tutorialconcepts>. 2. Glass, James, and Victor Zue. "Automatic Speech Recognition." MIT OpenCourseWare. Massachusetts Institute of Technology, 2003. Web. 14 Mar. 2014. <http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-345automatic-speech-recognition-spring-2003/>. 10 | P a g e