Speech

advertisement
2.3 Speech Technologies
This chapter discuses two areas or speech technologies, Text to speech (TTS) that called
Synthetic speech and speech to Text (STT) that called recognition. So far the keyboards
are the most popular input devices used by the computer application. In the recent time
many computer applications where human can interact with the computer using speech
has been created.
Speech recognition is the process by which a computer identifies the user’s spoken words
and then has them correctly recognize. The reader of this chapter has to know some basic
aspects In order to for understand speech recognition

Utterance: An utterance is the speaking of a word or words that represent a single
meaning to the computer. Utterances can be a single word, a few words, a sentence,
or even multiple sentences.

Speaker Dependence/ independent: The Speaker dependent system is designed for
a specific speaker. The system’s performance is highly accurate for the correct
speaker, but much less accurate for other speakers. They assume the speaker will
speak in a consistent voice. The Speaker independent systems are designed for a
variety of speakers. Since different users will have different voices and accents, the
system accuracy is less than the speaker dependent accuracy

Vocabularies: The Vocabularies are lists of words or utterances that can be
recognized by the speech recognition system. Smaller vocabularies are easier for a
computer to recognize, while the large vocabularies are harder. The entry of the
vocabulary can be as long as a sentence or two.

Accuracy: The accuracy of a recognizer can be examined by its ability of identifying
an utterance and if the spoken utterance is not in its vocabulary. The vocabulary size
affects the accuracy of the recognizer. Good speech recognizer systems can have an
accuracy of 98% or more! The acceptable accuracy of a system really depends on the
application.
Training
Some speech recognizers have the ability to adapt to a speaker. A speech
recognition system is trained by having the speaker repeat standard or common
phrases and adjusting its comparison algorithms to match that particular speaker.
Training a recognizer usually improves its accuracy.

The events of a keyboard, a mouse, a timer and socket are instantaneous in time1.
But the speech events are not for two reasons.
1
There is a defined instant at which they occur
1. Firstly, Speaking a sentence takes time, thus, it is not instantaneous. The
recognizer will start at the start of the speech, and then produce the
recognition result as soon as possible after the end of the speech.
2. Secondly, recognizers cannot always recognize words immediately when
they are spoken, beside recognizers cannot determine immediately when a
user has stopped speaking2. [java speech programmer guide speech
recognition]
The speech recognition can be broke down into several different classes depending on the
type of utterances they can recognize, and the ability for the speech recognizer to know
when the speaker starts and finishes an utterance. Those classes are

Isolated Words: this class requires a single utterance at a time. Often, these systems
can have one of the two states "Listen/Not-Listen". The Isolated words require the
speaker to pause between utterances (usually doing processing during the pauses).
Isolated Utterance might be a better name for this class.

Connected Words: Connect word systems are similar to Isolated words except that
they allow separate utterances to be run-together with a minimal pause between them.

Continuous Speech: The continuous speech recognizers allow users to speak almost
naturally, while the computer determines the content. They are some of the most
difficult recognizers to create because they must utilize special methods to determine
utterance boundaries

Spontaneous Speech: the spontaneous speech actually is a speech that is natural
sounding and not rehearsed. A spontaneous speech recognition system should be able
to handle a variety of natural speech features such as words being run together, "ums"
and ".

Voice Verification/Identification: the speech recognition systems can identify
specific users by their voices.
2.3.1 Speech Input Event Cycle [java speech
programmer guide speech recognition]
The typical recognition state cycle for a Recognizer occurs as speech input
occurs in order to represents the recognition of a single Result. The Recognizer
inherits its basic state systems defined in the javax.speech package through the
Engine interface. The basic engine state systems are
2
The same principals are generally true of human perception of speech.

LISTENING: The Recognizer starts in the LISTENING state with a
certain set of active grammars.

PROCESSING:

SUSPENDED: The recognizer indicates completion of recognition by issuing
a RECOGNIZER_SUSPENDED event in order to transition from the
PROCESSING state to the SUSPENDED state. And then, the recognizer issues
a result finalization event to ResultListeners to indicate that all
information about the result is finalized. The Recognizer remains in the
SUSPENDED state until processing of the result finalization event is
completed. Lastly The Recognizer issues a CHANGES_COMMITTED event to
return to the LISTENING state.
The recognizer is in listening state until detecting any
incoming audio, which might match an active grammar. When the
Recognizer transitions from the LISTENING state to the PROCESSING state
with a RECOGNIZER_PROCESSING event. The Recognizer remains in the
PROCESSING state until it completes recognition of the result. While in the
PROCESSING state the Result may be updated with new information.
In the SUSPENDED state the incoming speech data will not be lost because, the
Recognizer buffers this incoming audio. When the Recognizer returns to the
LISTENING state the buffered audio is processed to give the user the perception of
real-time processing. So the SUSPENDED state serves as a temporary state in which
recognizer configuration can be updated without loosing audio data
In this event cycle the RECOGNIZER_PROCESSING and RECOGNIZER_SUSPENDED
events are triggered by the user’s action: starting and stopping speaking. While
CHANGES_COMMITTED event is triggered programmatically some time after the
RECOGNIZER_SUSPENDED event. The speech event cycle is shown in the figure
Proccessing
Recognizer Procedding
Committed-Change
Recognizer-Suspend
Listening
Suspended
Committed-Change
Figure 2 Speech event cycle
2.3.1 IBM ViaVoice
The IBM ViaVoice engine provides developer with the necessary tolls in order to
develop applications that incorporate speech. IBM ViaVoice engine includes many
application-programming interfaces (APIs) that allow an application to access a speech
resources and the developer to manage what the user can say in the application.
The IBM ViaVoice engine includes both the recognizer that supports speech to text
conversion (STT) and synthesizer that support voice command recognition, dictation and
text-to-speech conversion (TTS).
The Java Speech API (JSAPI) is required to provide a cross-platform interface to the
speech
engines.
The
JSAPI
contain
three
packages:
javax.speech,
javax.speech.recognition and javax.speech.synthesis.
Moreover IBM speech for Java
works as a java-programming interface for incorporating IBM ViaVoice technology. The
different layers for an application in order to access and use the IBM ViaVoice engine are
shown in figure.
Figure 2-7 The relation and layers of an application and IBM ViaVoice Speech Engine
 Synthesizer [tts-us-en readme description ]
IBM Text-to-Speech provides the speech synthesis engine and components that necessary
for applications to produce speech. The human recorded speech units will be combined
according to linguistic rules that formulated from analyzed text in order to produce a
natural human sounding speech.
The Speech synthesis engine and data include capability for two types of speech
presentation, a concatenate voice dataset and a computer synthesized voice known as
format synthesis. The concatenate voice dataset is a speech presentation spoken by a
professional speaker, speaking a particular language and accent, recorded at a particular
sampling rate. In case the user changes the language a new voice dataset will be loaded
into the memory if it is not cached there. For example, if the user is using English at
8KHz with voice 1 and U.S. English voice 1 at 8Khz has been installed, then the system
will automatically do concatenate synthesis. Otherwise, the system will do formant
synthesis.
Since the speech synthesizers do not understand what they say, they do not always use
the right style or phrasing and do not provide the same naturalness as people. The Java
Speech API Markup Language (JSML) allows applications to provide the spoken text
with additional information in order to improve the quality and naturalness of the
synthesized speech. [jsml]
JSML is an XML Application that defines a specific set of elements to mark up the
spoken text. Besides, JSML defines the explanations of those elements so there will be a
common understanding between synthesizers and documents producers of how to speak
the marked up text. The JSML element set includes several types of element such as:
structural elements to mark paragraphs and sentences, Control elements to control the
pronunciation of words and phrases, the emphasis of words3, the placements of
boundaries and pauses, and the control of speaking rate, and the last JSML elements is
representing markers embedded (fixed) in text to enable synthesizer-specific controls.
[jsml]
3
If the words are stressing or accenting
 Recognizer [developer part for IBM ]
The speech recognition is the heart of the speech recognition. Speech recognition systems
provide computers with the ability to listen to user speech, determine what is said and
then translates it into a text that the application can understand; this text will then be used
by the application.
The speech recognizer engine works with some resources such as user’s language of
origin and domain. The s user’s language of origin is the language used by the speakers.
IBM Viavoice on window support many languages such as: Us English, UK English,
German, French, Italian, Arabic, Japanese and Chinese. On the other hand the domain is
a set of vocabulary and word-usage module designed to support an application. Both the
vocabulary and the word-usage module are used together by the speech engine to decode
speech for the applications.
In general each application consist of different parts, for each of those parts the user will
say something different, so an important aspect for the application is to know what the
user wants to say. The words or phrases that the user can say called vocabulary. The
accuracy and speed of recognition those vocabulary depends on the their size.
The speech recognizer constrains that vocabulary by using grammar in order to achieve
reasonable recognition accuracy and response time. The grammar is a collection of words
and phrases that bounded together by rules that define the set of all the speech that
represent a complete command that can be recognized by the speech engine at specific
time. The grammar consists of header and body. The header declares the grammar name
and lists the imported rules and grammars. The body defines the grammar’s rules as
combinations of spoken text and references to other rules.
grammar javax.speech.demo;
The Header
public <sentence> = good morning | good bye | Yes;
The Body
The grammar can be created using a plain text editor and specified by using specialized
speech recognition control language or SRCL4. When compiling the grammar file the
grammar compiler converts it into a binary file that can be used by the speech engine that
will determine the existing phrases and words for the user to say.
When the developer design their own grammar files they have to take into the account the
following aspects
4
Designed as joint effort between the SRAPI (speech recognizer API) and ECTF (Enter Price Computer
Technology Forum)
1. The number of phrases is small in order to have fast and more accurate speech
recognition. Moreover limiting the size make the process of matching word
easier.
2. Developing long and narrow grammar is better than short and wide grammar.
3. It is important to avoid using very similar pronunciation5 words at the same
portion of the grammar.
4. Not to allow the users to say thing in multiple way cause that will enhance
usability.
5. A voiding supporting many ways of saying the command so the user wont
remember what to say.
Since the user might speck some undefined words or the background environment might
be noisy so the recognizing engine might accept or reject the spoken words. The spoken
vocabulary can be rejected when the time out is exceeded; the user stopped before
completing his phrase or the source of the word is low relative to the threshold setting6.
Lastly the user might not always speak using an exact continues speech. They might
pause or interject extraneousness speech it to their vocabulary. Those occurrences are
called embedded silence. The developers can request the engine to handle the embedded
silence for their grammar.
2.3.2 java speech API [java speech api speech
recognition 4.6 and introduction]
The Java Speech API (JSAPI) was developed by Sun Microsoft in cooperation with
companies dealing with speech technologies such IBM and Lernout & Hauspie in order
to provide a cross-platform interface to the speech engines. JSAPI is designed in order to
access the speech recognizer and synthesizer, to keep simple speech applications simple
and to make advanced speech applications possible for non-specialist developers.
The javax.speech.recognition package defines the Recognizer interface to support speech
recognition, where javax.speech.synthesis define the synthesis interface to support speech
synthesizing. Much of the their functionality inherited from the Engine interface in the
javax.speech package.
The Java Speech API supports two types of grammars: rule grammars and dictation
grammars. These grammars are different in how patterns of words are defined and in
their programmatic use. A dictation grammar is built into a recognizer. It defines a huge
set of words7 that might be spoken in an unrestricted way. Beside, dictation grammars are
more flexible than rule grammars. A rule grammar is provided to a recognizer by an
application in order to define a set of rules that indicates what a user may say.
Similar words can be like using “cat and hat”
Those setting are available through ViaVoice proprieties in the Windows control panel
7
The words set can possibly be tens of thousands of words
5
6
Rules can be defined by tokens, references to other rules and by logical combinations of
both. Rule grammars can capture a wide range of spoken input from users using
combination of simple grammars and rules.
Download