Speech Walkthrough: C#
Recognizing Voice Commands with Microsoft Speech API
About this Walkthrough In the Kinect™ for Windows® Software Development Kit (SDK) Beta, Speech
is a C# console application that demonstrates how to use the microphone array in the Kinect for Xbox
360® sensor with Microsoft Speech API (SAPI) to recognize voice commands. This document is a
walkthrough of the beta SDK Speech sample application.
Resources For a complete list of documentation for the Kinect for Windows SDK Beta, plus related
reference and links to the online forums, see the beta SDK website at:
http://kinectforwindows.org
Contents
Introduction ....................................................................................................................................................................................................... 2
Program Basics ................................................................................................................................................................................................. 2
Create and Configure an Audio Source Object .................................................................................................................................. 4
Create a Speech Recognition Engine ...................................................................................................................................................... 4
Specify the Commands ................................................................................................................................................................................. 5
Recognize Commands................................................................................................................................................................................... 7
License: The Kinect for Windows SDK Beta is licensed for non-commercial use only. By installing, copying, or otherwise
using the beta SDK, you agree to be bound by the terms of its license. Read the license.
Disclaimer: This document is provided “as-is”. Information and views expressed in this document, including URL and other
Internet Web site references, may change without notice. You bear the risk of using it.
This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may
copy and use this document for your internal, reference purposes.
© 2011 Microsoft Corporation. All rights reserved.
Microsoft, DirectX, Kinect, MSDN, and Windows are trademarks of the Microsoft group of companies. All other trademarks
are property of their respective owners.
Speech Walkthrough: C# – 2
Introduction
The audio component of the Kinect™ for Xbox 360® sensor is a four-element microphone array. An
array provides some significant advantages over a single microphone, including more sophisticated
acoustic echo cancellation and noise suppression, and the ability to use beamforming algorithms,
which allow the array to function as a steerable directional microphone.
One key aspect of a natural user interface (NUI) is speech recognition. The Kinect sensor’s microphone
array is an excellent input device for speech recognition-based applications. It provides better sound
quality than a comparable single microphone and is much more convenient to use than a headset. The
Speech sample shows how to use the Kinect sensor’s microphone array with the Microsoft.Speech API
to recognize voice commands.
For an example of how to implement a managed application to capture an audio stream from the
Kinect sensor’s microphone array, see the “RecordAudio Walkthrough” on the website for the Kinect
for Windows® Software Development Kit (SDK) Beta.
For examples of how to implement a C++ application to capture an audio stream from the Kinect
sensor’s microphone array, see “MicArrayEchoCancellation Walkthrough,” “AudioCaptureRaw
Walkthrough,” and “MFAudioFilter Walkthrough ” on the beta SDK website.
Before attempting to compile the Speech application, you must first install the following:

Microsoft Speech Platform - Software Development Kit (SDK), version 10.2 (x86 edition)

Microsoft Speech Platform – Server Runtime, version 10.2 (x86 edition)
The beta SDK runtime is x86-only, so you must download the x86 version of the speech runtime.

Kinect for Windows Runtime Language Pack, version 0.9
(acoustic model from Microsoft Speech Platform for the beta SDK)
Note: The online documentation for the Microsoft.Speech API on the Microsoft® Developer Network
(MSDN®) is limited. You should instead refer to the HTML Help file (CHM) that is included with the
Microsoft Speech Platform SDK. It is located at Program Files\Microsoft Speech Platform SDK\Docs.
Program Basics
Speech is installed with the Kinect for Windows Software Development Kit (SDK) Beta samples in
%KINECTSDK_DIR%\Samples\KinectSDKSamples.zip. Speech is a C# console application that is
implemented in a single file, Program.cs.
Important RecordAudio targets the x86 platform.
The basic program flow is as follows:
1.
Create an object to represent the Kinect sensor’s microphone array.
2.
Create a speech recognition object and specify a grammar.
3.
Respond to commands.
Speech Walkthrough: C# – 3
To use Speech
1.
Build the application.
2.
Press Ctrl+F5 to run the application.
3.
Face the Kinect sensor and say “red,” “green,” or “blue.”
The speech recognition prints notifications for each command, including the following:

Which member of the command set best fits the spoken command.

A confidence value for that estimate.

Whether the command was recognized or rejected as not part of the command set.
The speech recognition engine prints a notification if it recognizes the command, together with a
measure of confidence that is the engine’s estimate of the probability that the word is correctly
recognized. An example is shown in the following sample output, where the spoken words were “red,”
“blue,” and “yellow.”
Using: Microsoft Server Speech Recognition Language - Kinect (en-US)
Recognizing. Say: 'red', 'green' or 'blue'. Press ENTER to stop
Speech Hypothesized:
red
Speech Recognized:
red
Speech Hypothesized:
blue
Speech Recognized:
blue
Speech Hypothesized:
green
Speech Rejected
Writing file: RetainedAudio_4.wav
Stopping recognizer ...
The remainder of this document walks you through the application.
Note This document includes code examples, most of which have been edited for brevity and
readability. In particular, most routine error-correction code has been removed. For the complete code,
see the Speech sample. Hyperlinks in this walkthrough display reference content on the MSDN website.
Speech Walkthrough: C# – 4
Create and Configure an Audio Source Object
The KinectAudioSource object represents the Kinect sensor’s microphone array. Behind the scenes, it
uses the MSRKinectAudio Microsoft DirectX® Media object (DMO), as described in detail in
“MicArrayEchoCancellation Walkthrough” on the beta SDK website.
Most of the sample is implemented in Main. The first step is to create and configure
KinectAudioSource, as follows:
static void Main(string[] args)
{
using (var source = new KinectAudioSource())
{
source.FeatureMode = true;
source.AutomaticGainControl = false;
source.SystemMode = SystemMode.OptibeamArrayOnly;
...
}
...
}
You configure KinectAudioSource by setting various properties, which map directly to the
MSRKinectAudio DMO’s property keys. For details, see the reference documentation. The Speech
application configures KinectAudioSource as follows:

Feature mode is enabled.

Automatic gain control (AGC) is disabled.
AGC must be disabled for speech recognition.

The system mode is set to an adaptive beam without acoustic echo cancellation (AEC).
In this mode, the microphone array functions as a single-directional microphone that is pointed
within a few degrees of the audio source.
Create a Speech Recognition Engine
Speech creates a speech recognition engine, as follows:
static void Main(string[] args)
{
using (var source = new KinectAudioSource())
{
...
RecognizerInfo ri = SpeechRecognitionEngine.InstalledRecognizers()
.Where(r => r.Id == RecognizerId)
.FirstOrDefault();
using (var sre = new SpeechRecognitionEngine(ri.Id))
{
...
}
}
...
}
Speech Walkthrough: C# – 5
SpeechRecognitionEngine.InstalledRecognizers is a static method that returns a list of speech
recognition engines on the system. Speech uses a Language-Integrated Query (LINQ) to obtain the ID
of the first recognizer in the list and returns the results as a RecognizerInfo object. Speech then uses
RecognizerInfo.Id to create a SpeechRecognitionEngine object.
Specify the Commands
Speech uses command recognition to recognize three voice commands: “red,” “green,” and “blue.” You
specify these commands by creating and loading a grammar that contains the words to be recognized,
as follows:
static void Main(string[] args)
{
using (var source = new KinectAudioSource())
{
...
using (var sre = new SpeechRecognitionEngine(ri.Id))
{
var colors = new Choices();
colors.Add("red");
colors.Add("green");
colors.Add("blue");
var gb = new GrammarBuilder();
gb.Culture = ri.Culture;
gb.Append(colors);
var g = new Grammar(gb);
sre.LoadGrammar(g);
sre.SpeechRecognized += SreSpeechRecognized;
sre.SpeechHypothesized += SreSpeechHypothesized;
sre.SpeechRecognitionRejected += SreSpeechRecognitionRejected;
...
}
}
}
The Choices object represents the list of words to be recognized. To add words to the list, call
Choices.Add. After completing the list, create a new GrammarBuilder object—which provides a
simple way to construct a grammar—and specify the culture to match that of the recognizer. Then pass
the Choices object to GrammarBuilder.Append to define the grammar elements. Finally, load the
grammar into the speech engine by calling SpeechRecognitionEngine.LoadGrammar.
Each time you speak a word, the speech recognition compares your speech with the templates for the
words in the grammar to determine if it is one of the recognized commands. However, speech
recognition is an inherently uncertain process, so each attempt at recognition is accompanied by a
confidence value.
Speech Walkthrough: C# – 6
The Speech engine raises the following three events:

The SpeechRecognitionEngine.SpeechHypothesized event occurs for each attempted
command.
It passes the event handler a SpeechRecognizedEventArgs object that contains the best-fitting
word from the command set and a measure of the estimate’s confidence.
Note: The Kinect for Windows Language Pack for this beta SDK does not have a reliable
confidence model, so SpeechRecognizedEventArgs.Confidence is not used.

The SpeechRecognitionEngine.SpeechRecognized event occurs when an attempted command is
recognized as being a member of the command set.
It passes the event handler a SpeechRecognizedEventArgs object that contains the recognized
command.

The SpeechRecognitionEngine.SpeechRejected event occurs when an attempted command is
rejected as being a member of the command set.
It passes the event handler a SpeechRecognitionRejectedEventArgs object.
Speech subscribes to all three events and implements the handlers, as follows:
static void SreSpeechHypothesized(object sender,
SpeechHypothesizedEventArgs e)
{
Console.Write("\rSpeech Hypothesized: \t{0}\tConf:\t{1}",
e.Result.Text);
}
static void SreSpeechRecognized(object sender,
SpeechRecognizedEventArgs e)
{
Console.WriteLine("\nSpeech Recognized: \t{0}", e.Result.Text);
}
static void SreSpeechRecognitionRejected(object sender,
SpeechRecognitionRejectedEventArgs e)
{
Console.WriteLine("\nSpeech Rejected");
if (e.Result != null)
DumpRecordedAudio(e.Result.Audio);
}
The first two handlers simply print the key data from the event object. The
SreSpeechRecognitionRejected handler calls a private DumpRecordedAudio method to write the
recorded word to a WAV file. For details, see the sample.
Speech Walkthrough: C# – 7
Recognize Commands
After the speech recognition has been configured, all that Speech needs to do is to start the process.
The speech recognition engine automatically attempts to recognize the words in the grammar and
raises events as appropriate, as shown in the following code example:
static void Main(string[] args)
{
using (var source = new KinectAudioSource())
{
...
using (var sre = new SpeechRecognitionEngine(ri.Id))
{
...
using (Stream s = source.StartCapture(3))
{
sre.SetInputToAudioStream(s,
new SpeechAudioFormatInfo(EncodingFormat.Pcm,
16000, 16, 1,
32000, 2, null));
sre.RecognizeAsync(RecognizeMode.Multiple);
Console.ReadLine();
Console.WriteLine("Stopping recognizer ...");
sre.RecognizeAsyncStop();
}
}
}
}
Speech starts capturing audio from the Kinect sensor’s microphone array by calling
KinectAudioSource.StartCapture. Then Speech does the following:
1.
Calls SpeechRecognitionEngine.SetInputToAudioStream to specify the audio source and its
characteristics.
2.
Calls SpeechRecognitionEngine.RecognizeAsync and specifies asynchronous recognition.
The engine runs on a background thread until the user stops the process by pressing a key.
3.
Calls SpeechRecognitionEngine.RecognizeAsyncStop to stop the recognition process and
terminate the engine.
For More Information
For more information about implementing audio and related samples, see the Programming Guide
page on the Kinect for Windows SDK Beta website at:
http://kinectforwindows.org