Florida Tech Speech Processing Final Phoning Home ECE 5525: Dr. Kepuska Sean Powers 12/7/2010 1 Table of Contents Problem Statement ................................................................................................................................... 2 How it works ............................................................................................................................................. 3 System Architecture.................................................................................................................................. 4 Speech Recognition Engine ....................................................................................................................... 0 Demonstration .......................................................................................................................................... 1 Future Works ............................................................................................................................................ 2 Appendix ................................................................................................................................................... 3 2 Problem Statement The study of speech processing and recognition has led to many useful applications due to the natural feel of speech communication. We, as people, are trained from the beginning of our lives to communicate with speech. As technology advances it is no surprise that speech will be used as a means of simplifying our lives in various applications. Imagine you were to leave your house in a hurry because you were running late to work, had to pick up your kids from school or for any other reason and you forgot to turn off the stove. You remember three blocks away and you do not have the time to turn around. You then dial a phone number that is assigned to your house and ask your house to turn the stove off for you. Your house confirms the stove will be turned off and you are now relived you won’t come home to a potential fire. This is the essence of the Phoning Home system. Phoning Home allows you to make a phone call to your house and control your different devices/appliances. This could be useful for controlling the temperature of your air conditioner, turning on and off lights, open and close your garage door, etc. Therefore, phoning home is not only a convenient way to control your household appliances but can also be used to manage your energy use more efficiently. Figure 1: Phoning Home System 3 How it works Phoning Home is broken into four main components. The Voice over IP (VOIP) server which forwards the phone speech and uses text to speech (TTS) to speak to the user. The Phoning Home web services which handles the communication for all Phoning Home households. The speech recognition server which is responsible for recognizing the user’s phone speech. The client services which handle an individual households devices. Asterisk Server (VOIP) Phoning Home Services (WCF) Speech Recognizition (Cmu Sphinx) Figure 2: System Overview Client Services (WCF + MCU) 4 System Architecture Each device that can be controlled via the Phoning Home system is known as an A Wireless Appliance Reducing Energy (AWARE). Every device is wirelessly controlled via the Phoning Home Master Control which is ultimately an Atmel Atmega16 microcontroller connected via USB to the Client Services. Figure 3: Phoning Home Concept The communication between each component is described in more detail in the following sequence diagram: Figure 4: Phoning Home System Sequence Diagram Speech Recognition Engine The Carnegie Melon University (CMU) Sphinx speech recognition engine is being used by the Phoning Home system. This is an open source engine that is highly configurable. The Sphinx-4 framework consists of three primary modules: the FrontEnd, the Decoder, and the Linguist. The FrontEnd takes input signals and parameterizes them into a sequence of features. The Linguist translates any type of standard language model along with information from the Dictionary and structural information from one or more sets of Acoustic models into a SearchGraph. The Decoder uses the features from the FrontEnd, and the SearchGraph from the Linguist to perform the actual decoding and produce the Results. Figure 5 - Sphinx4 framework. The main blocks are the frontend, the decoder and the linguist. Supporting blocks include the ConfigurationManager and the tools block. Source: cmu sphinx-4, 2010 The Sphinx-4 framework has been carefully developed. The FrontEnd implementation supports MFCC, PLPC, and LPC feature extraction. The Linguist implementation supports a variety of language models, including CFGs, FSTs, and N-Grams. The Decoder supports a variety of SearchManager 1 implementations including but not limited to Viterbi, Bushderby and parallel searches. By utilizing Sphinx-4’s Configuration Manager various combinations of implementations can be very easily tested. It is because of the reasons mentioned above that make Sphinx-4 an ideal candidate for the speech recognition engine module in the Phoning Home System. The source code to Sphinx-4 is freely available under a BSD-style license. The license permits others to do academic and commercial research and to develop products and applications without requiring any licensing fees. Demonstration For the demonstration of Phoning Home I have used a commercially available implementation of the Asterisk VOIP server mentioned above, namely Tropo. Tropo provides a RESTful Web API to connect to their cloud based services. Figure 6: Tropo Web Services One issue I had to overcome when developing the demonstration was recognizing telephone speech. There are significant differences between microphone and telephone speech. From the Sphinx documentation: “The issue with telephone audio is that it has limited range of frequencies. Unlike usual microphone recording that includes frequencies from 1 Hz to 8000 kHz, telephone audio is passed through frequency filters. As a result telephone audio contains frequencies from 200 Hz to 3500 Hz. That makes it impossible to recognize telephone audio with usual microphone acoustic model. You need to use specialized models to recognize it.” Therefore, it was required that I found an acoustic model that was capable of recognizing telephone speech. I ended up using the VoxForge English model, as I found it to be the best performing and most efficient model for telephone quality speech. The VoxForge acoustic model was trained using 2 Linear Discriminant Analysis (LDA) transforms and required adjustment of the melfilter parameters. The melfilter parameters needed to be adjusted to the limit of the telephone speech bandwidth. Because Sphinx 4 is so modular and flexible it was simple to adjust the front end parameters through the Sphinx 4 configuration file, listed in the Appendix. Future Works Although Phoning Home is designed to allow you to call your house to control your appliances, it would be very useful to combine Dr. Kepuska’s Wake-up-Word (WuW) technology to allow you to control your appliances from inside the house as well. In a commercial product, Phoning Home would need to be equipped with extensive security measures to confirm the user calling their house actually has appropriate credentials to control their appliances. This could be as simple as a password or as complex as adding speaker recognition to confirm the user calling is a user that is permitted to call. 3 Appendix The grammar used by Phoning Home follows the Java Speech Grammar Format (JSGF) standards and is listed below: #JSGF V1.0; /** * JSGF Digits Grammar for Phoning Home */ grammar digits; public <command> = <polite> <startAction> room [number] <numbers> <devices> <endAction>; <polite> = [please | kindly | could you | oh mighty computer | operator]; <startAction> = (turn | switch); <devices> = (lights | lamps); <endAction> = (on | off); <numbers> = (oh | zero | one | two | three | four | five | six | seven | eight | nine); 4 The Sphinx 4 configuration file is listed below: <?xml version="1.0" encoding="UTF-8"?> <!-Sphinx-4 Configuration file --> <!-- ******************************************************** --> <!-- an4 configuration file --> <!-- ******************************************************** --> <config> <!-- ******************************************************** --> <!-- frequently tuned properties --> <!-- ******************************************************** --> <property name="logLevel" value="WARNING"/> <property <property <property <property name="absoluteBeamWidth" value="-1"/> name="relativeBeamWidth" value="1E-80"/> name="wordInsertionProbability" value="1E-36"/> name="languageWeight" value="8"/> <property name="frontend" value="epFrontEnd"/> <property name="recognizer" value="recognizer"/> <property name="showCreations" value="false"/> <!-- ******************************************************** --> <!-- word recognizer configuration --> <!-- ******************************************************** --> <component name="recognizer" type="edu.cmu.sphinx.recognizer.Recognizer"> <property name="decoder" value="decoder"/> <propertylist name="monitors"> <item>accuracyTracker </item> <item>speedTracker </item> <item>memoryTracker </item> </propertylist> </component> <!-- ******************************************************** --> <!-- The Decoder configuration --> <!-- ******************************************************** --> <component name="decoder" type="edu.cmu.sphinx.decoder.Decoder"> <property name="searchManager" value="searchManager"/> </component> <component name="searchManager" type="edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager"> <property name="logMath" value="logMath"/> <property name="linguist" value="flatLinguist"/> <!--<property name="linguist" value="lexTreeLinguist"/>--> 5 <property name="pruner" value="trivialPruner"/> <property name="scorer" value="threadedScorer"/> <property name="activeListFactory" value="activeList"/> </component> <component name="activeList" type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory"> <property name="logMath" value="logMath"/> <property name="absoluteBeamWidth" value="${absoluteBeamWidth}"/> <property name="relativeBeamWidth" value="${relativeBeamWidth}"/> </component> <component name="trivialPruner" type="edu.cmu.sphinx.decoder.pruner.SimplePruner"/> <component name="threadedScorer" type="edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer"> <property name="frontend" value="${frontend}"/> </component> <!-- ******************************************************** --> <!-- The linguist configuration --> <!-- ******************************************************** --> <component name="flatLinguist" type="edu.cmu.sphinx.linguist.flat.FlatLinguist"> <property name="logMath" value="logMath"/> <property name="grammar" value="jsgfGrammar"/> <property name="acousticModel" value="wsj"/> <property name="wordInsertionProbability" value="${wordInsertionProbability}"/> <property name="languageWeight" value="${languageWeight}"/> <property name="unitManager" value="unitManager"/> </component> <!-- ******************************************************** --> <!-- The linguist configuration --> <!-- ******************************************************** --> <component name="lexTreeLinguist" type="edu.cmu.sphinx.linguist.lextree.LexTreeLinguist"> <property name="logMath" value="logMath"/> <property name="acousticModel" value="wsj"/> <property name="languageModel" value="trigramModel"/> <property name="dictionary" value="dictionary"/> <property name="addFillerWords" value="false"/> <property name="fillerInsertionProbability" value="1E-10"/> <property name="generateUnitStates" value="false"/> <property name="wantUnigramSmear" value="true"/> <property name="unigramSmearWeight" value="1"/> <property name="wordInsertionProbability" value="1E-16"/> <property name="silenceInsertionProbability" value=".1"/> <property name="languageWeight" value="8.0"/> <property name="unitManager" value="unitManager"/> </component> 6 <!-- ******************************************************** --> <!-- The Grammar configuration --> <!-- ******************************************************** --> <component name="jsgfGrammar" type="edu.cmu.sphinx.jsgf.JSGFGrammar"> <property name="dictionary" value="dictionary"/> <property name="grammarLocation" value="resource:/com/edu/phoninghome/"/> <property name="grammarName" value="digits"/> <property name="logMath" value="logMath"/> </component> <!-- ******************************************************** --> <!-- The Dictionary configuration --> <!-- ******************************************************** --> <component name="dictionary" type="edu.cmu.sphinx.linguist.dictionary.FastDictionary"> <property name="dictionaryPath" value="resource:/voxforge/etc/cmudict.0.7a"/> <property name="fillerPath" value="resource:/voxforge/etc/voxforge_en_sphinx.filler"/> <property name="addSilEndingPronunciation" value="false"/> <property name="wordReplacement" value="&lt;sil&gt;"/> <property name="unitManager" value="unitManager"/> </component> <!-- ******************************************************** --> <!-- The acoustic model configuration --> <!-- ******************************************************** --> <component name="wsj" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel"> <property name="loader" value="wsjLoader"/> <property name="unitManager" value="unitManager"/> </component> <component name="wsjLoader" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader"> <property name="logMath" value="logMath"/> <property name="unitManager" value="unitManager"/> <property name="location" value="resource:/voxforge"/> <property name="modelDefinition" value="model_parameters/voxforge_en_sphinx.cd_cont_3000/mdef"/> <property name="dataLocation" value="model_parameters/voxforge_en_sphinx.cd_cont_3000/"/> </component> <!-- ******************************************************** --> <!-- The Language Model configuration --> <!-- ******************************************************** --> <component name="trigramModel" type="edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel"> <property name="location" value="resource:/voxforge/etc/voxforge_en_sphinx.lm"/> 7 <property <property <property <property </component> name="logMath" value="logMath"/> name="dictionary" value="dictionary"/> name="maxDepth" value="3"/> name="unigramWeight" value=".7"/> <!-- ******************************************************** --> <!-- The unit manager configuration --> <!-- ******************************************************** --> <component name="unitManager" type="edu.cmu.sphinx.linguist.acoustic.UnitManager"/> <!-- ******************************************************** --> <!-- The live frontend configuration --> <!-- ******************************************************** --> <component name="epFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd"> <propertylist name="pipeline"> <item>audioFileDataSource </item> <item>dataBlocker </item> <item>speechClassifier </item> <item>speechMarker </item> <item>nonSpeechDataFilter </item> <item>preemphasizer </item> <item>windower </item> <item>fft </item> <item>melFilterBank </item> <item>dct </item> <item>liveCMN </item> <item>featureExtraction </item> <item>lda </item> </propertylist> </component> <!-- ******************************************************** --> <!-- The frontend pipelines --> <!-- ******************************************************** --> <component name="audioFileDataSource" type="edu.cmu.sphinx.frontend.util.AudioFileDataSource"/> <component name="dataBlocker" type="edu.cmu.sphinx.frontend.DataBlocker"/> <component name="speechClassifier" type="edu.cmu.sphinx.frontend.endpoint.SpeechClassifier"/> <component name="nonSpeechDataFilter" type="edu.cmu.sphinx.frontend.endpoint.NonSpeechDataFilter"/> <component name="speechMarker" type="edu.cmu.sphinx.frontend.endpoint.SpeechMarker" /> <component name="preemphasizer" type="edu.cmu.sphinx.frontend.filter.Preemphasizer"/> 8 <component name="windower" type="edu.cmu.sphinx.frontend.window.RaisedCosineWindower"> </component> <component name="fft" type="edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform"> </component> <component name="melFilterBank" type="edu.cmu.sphinx.frontend.frequencywarp.MelFrequencyFilterBank"> <property name="numberFilters" value="31"/> <property name="minimumFrequency" value="200"/> <property name="maximumFrequency" value="3500"/> </component> <component name="dct" type="edu.cmu.sphinx.frontend.transform.DiscreteCosineTransform"/> <component name="liveCMN" type="edu.cmu.sphinx.frontend.feature.LiveCMN"/> <component name="featureExtraction" type="edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor" /> <!-<component name="featureExtraction" type="edu.cmu.sphinx.frontend.feature.ConcatFeatureExtractor"> <property name="windowSize" value="3"/> </component> --> <component name="lda" type="edu.cmu.sphinx.frontend.feature.FeatureTransform"> <property name="loader" value="wsjLoader"/> </component> <!-- ******************************************************* --> <!-- monitors --> <!-- ******************************************************* --> <component name="accuracyTracker" type="edu.cmu.sphinx.instrumentation.BestPathAccuracyTracker"> <property name="recognizer" value="${recognizer}"/> <property name="showAlignedResults" value="false"/> <property name="showRawResults" value="false"/> </component> <component name="memoryTracker" type="edu.cmu.sphinx.instrumentation.MemoryTracker"> <property name="recognizer" value="${recognizer}"/> <property name="showSummary" value="false"/> <property name="showDetails" value="false"/> </component> <component name="speedTracker" type="edu.cmu.sphinx.instrumentation.SpeedTracker"> 9 <property name="recognizer" value="${recognizer}"/> <property name="frontend" value="${frontend}"/> <property name="showSummary" value="true"/> <property name="showDetails" value="false"/> </component> <!-- ******************************************************* --> <!-- Miscellaneous components --> <!-- ******************************************************* --> <component name="logMath" type="edu.cmu.sphinx.util.LogMath"> <property name="logBase" value="1.0001"/> <property name="useAddTable" value="true"/> </component> </config>