component

advertisement
Florida Tech
Speech Processing Final
Phoning Home
ECE 5525: Dr. Kepuska
Sean Powers
12/7/2010
1
Table of Contents
Problem Statement ................................................................................................................................... 2
How it works ............................................................................................................................................. 3
System Architecture.................................................................................................................................. 4
Speech Recognition Engine ....................................................................................................................... 0
Demonstration .......................................................................................................................................... 1
Future Works ............................................................................................................................................ 2
Appendix ................................................................................................................................................... 3
2
Problem Statement
The study of speech processing and recognition has led to many useful applications due to the
natural feel of speech communication. We, as people, are trained from the beginning of our lives to
communicate with speech. As technology advances it is no surprise that speech will be used as a means
of simplifying our lives in various applications.
Imagine you were to leave your house in a hurry because you were running late to work, had to
pick up your kids from school or for any other reason and you forgot to turn off the stove. You
remember three blocks away and you do not have the time to turn around. You then dial a phone
number that is assigned to your house and ask your house to turn the stove off for you. Your house
confirms the stove will be turned off and you are now relived you won’t come home to a potential fire.
This is the essence of the Phoning Home system.
Phoning Home allows you to make a phone call to your house and control your different
devices/appliances. This could be useful for controlling the temperature of your air conditioner, turning
on and off lights, open and close your garage door, etc. Therefore, phoning home is not only a
convenient way to control your household appliances but can also be used to manage your energy use
more efficiently.
Figure 1: Phoning Home System
3
How it works
Phoning Home is broken into four main components. The Voice over IP (VOIP) server which
forwards the phone speech and uses text to speech (TTS) to speak to the user. The Phoning Home web
services which handles the communication for all Phoning Home households. The speech recognition
server which is responsible for recognizing the user’s phone speech. The client services which handle an
individual households devices.
Asterisk
Server (VOIP)
Phoning
Home
Services
(WCF)
Speech
Recognizition
(Cmu Sphinx)
Figure 2: System Overview
Client
Services
(WCF + MCU)
4
System Architecture
Each device that can be controlled via the Phoning Home system is known as an A Wireless
Appliance Reducing Energy (AWARE). Every device is wirelessly controlled via the Phoning Home Master
Control which is ultimately an Atmel Atmega16 microcontroller connected via USB to the Client Services.
Figure 3: Phoning Home Concept
The communication between each component is described in more detail in the following sequence diagram:
Figure 4: Phoning Home System Sequence Diagram
Speech Recognition Engine
The Carnegie Melon University (CMU) Sphinx speech recognition engine is being used by the
Phoning Home system. This is an open source engine that is highly configurable.
The Sphinx-4 framework consists of three primary modules: the FrontEnd, the Decoder, and the
Linguist. The FrontEnd takes input signals and parameterizes them into a sequence of features. The
Linguist translates any type of standard language model along with information from the Dictionary and
structural information from one or more sets of Acoustic models into a SearchGraph. The Decoder uses
the features from the FrontEnd, and the SearchGraph from the Linguist to perform the actual decoding
and produce the Results.
Figure 5 - Sphinx4 framework. The main blocks are the frontend, the decoder and the linguist.
Supporting blocks include the ConfigurationManager and the tools block. Source: cmu sphinx-4, 2010
The Sphinx-4 framework has been carefully developed. The FrontEnd implementation supports
MFCC, PLPC, and LPC feature extraction. The Linguist implementation supports a variety of language
models, including CFGs, FSTs, and N-Grams. The Decoder supports a variety of SearchManager
1
implementations including but not limited to Viterbi, Bushderby and parallel searches. By utilizing
Sphinx-4’s Configuration Manager various combinations of implementations can be very easily tested. It
is because of the reasons mentioned above that make Sphinx-4 an ideal candidate for the speech
recognition engine module in the Phoning Home System.
The source code to Sphinx-4 is freely available under a BSD-style license. The license permits
others to do academic and commercial research and to develop products and applications without
requiring any licensing fees.
Demonstration
For the demonstration of Phoning Home I have used a commercially available implementation
of the Asterisk VOIP server mentioned above, namely Tropo. Tropo provides a RESTful Web API to
connect to their cloud based services.
Figure 6: Tropo Web Services
One issue I had to overcome when developing the demonstration was recognizing telephone
speech. There are significant differences between microphone and telephone speech. From the Sphinx
documentation:
“The issue with telephone audio is that it has limited range of frequencies. Unlike usual
microphone recording that includes frequencies from 1 Hz to 8000 kHz, telephone audio is passed
through frequency filters. As a result telephone audio contains frequencies from 200 Hz to 3500 Hz. That
makes it impossible to recognize telephone audio with usual microphone acoustic model. You need to
use specialized models to recognize it.”
Therefore, it was required that I found an acoustic model that was capable of recognizing
telephone speech. I ended up using the VoxForge English model, as I found it to be the best performing
and most efficient model for telephone quality speech. The VoxForge acoustic model was trained using
2
Linear Discriminant Analysis (LDA) transforms and required adjustment of the melfilter parameters. The
melfilter parameters needed to be adjusted to the limit of the telephone speech bandwidth. Because
Sphinx 4 is so modular and flexible it was simple to adjust the front end parameters through the Sphinx
4 configuration file, listed in the Appendix.
Future Works
Although Phoning Home is designed to allow you to call your house to control your appliances, it
would be very useful to combine Dr. Kepuska’s Wake-up-Word (WuW) technology to allow you to
control your appliances from inside the house as well.
In a commercial product, Phoning Home would need to be equipped with extensive security
measures to confirm the user calling their house actually has appropriate credentials to control their
appliances. This could be as simple as a password or as complex as adding speaker recognition to
confirm the user calling is a user that is permitted to call.
3
Appendix
The grammar used by Phoning Home follows the Java Speech Grammar Format (JSGF) standards
and is listed below:
#JSGF V1.0;
/**
* JSGF Digits Grammar for Phoning Home
*/
grammar digits;
public <command> = <polite> <startAction> room [number] <numbers> <devices>
<endAction>;
<polite> = [please | kindly | could you | oh mighty computer | operator];
<startAction> = (turn | switch);
<devices> = (lights | lamps);
<endAction> = (on | off);
<numbers> = (oh | zero | one | two | three | four | five | six | seven |
eight | nine);
4
The Sphinx 4 configuration file is listed below:
<?xml version="1.0" encoding="UTF-8"?>
<!-Sphinx-4 Configuration file
-->
<!-- ******************************************************** -->
<!-- an4 configuration file
-->
<!-- ******************************************************** -->
<config>
<!-- ******************************************************** -->
<!-- frequently tuned properties
-->
<!-- ******************************************************** -->
<property name="logLevel" value="WARNING"/>
<property
<property
<property
<property
name="absoluteBeamWidth" value="-1"/>
name="relativeBeamWidth" value="1E-80"/>
name="wordInsertionProbability" value="1E-36"/>
name="languageWeight"
value="8"/>
<property name="frontend" value="epFrontEnd"/>
<property name="recognizer" value="recognizer"/>
<property name="showCreations" value="false"/>
<!-- ******************************************************** -->
<!-- word recognizer configuration
-->
<!-- ******************************************************** -->
<component name="recognizer" type="edu.cmu.sphinx.recognizer.Recognizer">
<property name="decoder" value="decoder"/>
<propertylist name="monitors">
<item>accuracyTracker </item>
<item>speedTracker </item>
<item>memoryTracker </item>
</propertylist>
</component>
<!-- ******************************************************** -->
<!-- The Decoder
configuration
-->
<!-- ******************************************************** -->
<component name="decoder" type="edu.cmu.sphinx.decoder.Decoder">
<property name="searchManager" value="searchManager"/>
</component>
<component name="searchManager"
type="edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager">
<property name="logMath" value="logMath"/>
<property name="linguist" value="flatLinguist"/>
<!--<property name="linguist" value="lexTreeLinguist"/>-->
5
<property name="pruner" value="trivialPruner"/>
<property name="scorer" value="threadedScorer"/>
<property name="activeListFactory" value="activeList"/>
</component>
<component name="activeList"
type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory">
<property name="logMath" value="logMath"/>
<property name="absoluteBeamWidth" value="${absoluteBeamWidth}"/>
<property name="relativeBeamWidth" value="${relativeBeamWidth}"/>
</component>
<component name="trivialPruner"
type="edu.cmu.sphinx.decoder.pruner.SimplePruner"/>
<component name="threadedScorer"
type="edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer">
<property name="frontend" value="${frontend}"/>
</component>
<!-- ******************************************************** -->
<!-- The linguist configuration
-->
<!-- ******************************************************** -->
<component name="flatLinguist"
type="edu.cmu.sphinx.linguist.flat.FlatLinguist">
<property name="logMath" value="logMath"/>
<property name="grammar" value="jsgfGrammar"/>
<property name="acousticModel" value="wsj"/>
<property name="wordInsertionProbability"
value="${wordInsertionProbability}"/>
<property name="languageWeight" value="${languageWeight}"/>
<property name="unitManager" value="unitManager"/>
</component>
<!-- ******************************************************** -->
<!-- The linguist configuration
-->
<!-- ******************************************************** -->
<component name="lexTreeLinguist"
type="edu.cmu.sphinx.linguist.lextree.LexTreeLinguist">
<property name="logMath" value="logMath"/>
<property name="acousticModel" value="wsj"/>
<property name="languageModel" value="trigramModel"/>
<property name="dictionary" value="dictionary"/>
<property name="addFillerWords" value="false"/>
<property name="fillerInsertionProbability" value="1E-10"/>
<property name="generateUnitStates" value="false"/>
<property name="wantUnigramSmear" value="true"/>
<property name="unigramSmearWeight" value="1"/>
<property name="wordInsertionProbability" value="1E-16"/>
<property name="silenceInsertionProbability" value=".1"/>
<property name="languageWeight" value="8.0"/>
<property name="unitManager" value="unitManager"/>
</component>
6
<!-- ******************************************************** -->
<!-- The Grammar configuration
-->
<!-- ******************************************************** -->
<component name="jsgfGrammar" type="edu.cmu.sphinx.jsgf.JSGFGrammar">
<property name="dictionary" value="dictionary"/>
<property name="grammarLocation"
value="resource:/com/edu/phoninghome/"/>
<property name="grammarName" value="digits"/>
<property name="logMath" value="logMath"/>
</component>
<!-- ******************************************************** -->
<!-- The Dictionary configuration
-->
<!-- ******************************************************** -->
<component name="dictionary"
type="edu.cmu.sphinx.linguist.dictionary.FastDictionary">
<property name="dictionaryPath"
value="resource:/voxforge/etc/cmudict.0.7a"/>
<property name="fillerPath"
value="resource:/voxforge/etc/voxforge_en_sphinx.filler"/>
<property name="addSilEndingPronunciation" value="false"/>
<property name="wordReplacement" value="<sil>"/>
<property name="unitManager" value="unitManager"/>
</component>
<!-- ******************************************************** -->
<!-- The acoustic model configuration
-->
<!-- ******************************************************** -->
<component name="wsj"
type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel">
<property name="loader" value="wsjLoader"/>
<property name="unitManager" value="unitManager"/>
</component>
<component name="wsjLoader"
type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader">
<property name="logMath" value="logMath"/>
<property name="unitManager" value="unitManager"/>
<property name="location" value="resource:/voxforge"/>
<property name="modelDefinition"
value="model_parameters/voxforge_en_sphinx.cd_cont_3000/mdef"/>
<property name="dataLocation"
value="model_parameters/voxforge_en_sphinx.cd_cont_3000/"/>
</component>
<!-- ******************************************************** -->
<!-- The Language Model configuration
-->
<!-- ******************************************************** -->
<component name="trigramModel"
type="edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel">
<property name="location"
value="resource:/voxforge/etc/voxforge_en_sphinx.lm"/>
7
<property
<property
<property
<property
</component>
name="logMath" value="logMath"/>
name="dictionary" value="dictionary"/>
name="maxDepth" value="3"/>
name="unigramWeight" value=".7"/>
<!-- ******************************************************** -->
<!-- The unit manager configuration
-->
<!-- ******************************************************** -->
<component name="unitManager"
type="edu.cmu.sphinx.linguist.acoustic.UnitManager"/>
<!-- ******************************************************** -->
<!-- The live frontend configuration
-->
<!-- ******************************************************** -->
<component name="epFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd">
<propertylist name="pipeline">
<item>audioFileDataSource </item>
<item>dataBlocker </item>
<item>speechClassifier </item>
<item>speechMarker </item>
<item>nonSpeechDataFilter </item>
<item>preemphasizer </item>
<item>windower </item>
<item>fft </item>
<item>melFilterBank </item>
<item>dct </item>
<item>liveCMN </item>
<item>featureExtraction </item>
<item>lda </item>
</propertylist>
</component>
<!-- ******************************************************** -->
<!-- The frontend pipelines
-->
<!-- ******************************************************** -->
<component name="audioFileDataSource"
type="edu.cmu.sphinx.frontend.util.AudioFileDataSource"/>
<component name="dataBlocker"
type="edu.cmu.sphinx.frontend.DataBlocker"/>
<component name="speechClassifier"
type="edu.cmu.sphinx.frontend.endpoint.SpeechClassifier"/>
<component name="nonSpeechDataFilter"
type="edu.cmu.sphinx.frontend.endpoint.NonSpeechDataFilter"/>
<component name="speechMarker"
type="edu.cmu.sphinx.frontend.endpoint.SpeechMarker" />
<component name="preemphasizer"
type="edu.cmu.sphinx.frontend.filter.Preemphasizer"/>
8
<component name="windower"
type="edu.cmu.sphinx.frontend.window.RaisedCosineWindower">
</component>
<component name="fft"
type="edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform">
</component>
<component name="melFilterBank"
type="edu.cmu.sphinx.frontend.frequencywarp.MelFrequencyFilterBank">
<property name="numberFilters" value="31"/>
<property name="minimumFrequency" value="200"/>
<property name="maximumFrequency" value="3500"/>
</component>
<component name="dct"
type="edu.cmu.sphinx.frontend.transform.DiscreteCosineTransform"/>
<component name="liveCMN"
type="edu.cmu.sphinx.frontend.feature.LiveCMN"/>
<component name="featureExtraction"
type="edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor" />
<!-<component name="featureExtraction"
type="edu.cmu.sphinx.frontend.feature.ConcatFeatureExtractor">
<property name="windowSize" value="3"/>
</component>
-->
<component name="lda"
type="edu.cmu.sphinx.frontend.feature.FeatureTransform">
<property name="loader" value="wsjLoader"/>
</component>
<!-- ******************************************************* -->
<!-- monitors
-->
<!-- ******************************************************* -->
<component name="accuracyTracker"
type="edu.cmu.sphinx.instrumentation.BestPathAccuracyTracker">
<property name="recognizer" value="${recognizer}"/>
<property name="showAlignedResults" value="false"/>
<property name="showRawResults" value="false"/>
</component>
<component name="memoryTracker"
type="edu.cmu.sphinx.instrumentation.MemoryTracker">
<property name="recognizer" value="${recognizer}"/>
<property name="showSummary" value="false"/>
<property name="showDetails" value="false"/>
</component>
<component name="speedTracker"
type="edu.cmu.sphinx.instrumentation.SpeedTracker">
9
<property name="recognizer" value="${recognizer}"/>
<property name="frontend" value="${frontend}"/>
<property name="showSummary" value="true"/>
<property name="showDetails" value="false"/>
</component>
<!-- ******************************************************* -->
<!-- Miscellaneous components
-->
<!-- ******************************************************* -->
<component name="logMath" type="edu.cmu.sphinx.util.LogMath">
<property name="logBase" value="1.0001"/>
<property name="useAddTable" value="true"/>
</component>
</config>
Download