2.3 Speech Technologies This chapter discuses two areas or speech technologies, Text to speech (TTS) that called Synthetic speech and speech to Text (STT) that called recognition. So far the keyboards are the most popular input devices used by the computer application. In the recent time many computer applications where human can interact with the computer using speech has been created. Speech recognition is the process by which a computer identifies the user’s spoken words and then has them correctly recognize. The reader of this chapter has to know some basic aspects In order to for understand speech recognition Utterance: An utterance is the speaking of a word or words that represent a single meaning to the computer. Utterances can be a single word, a few words, a sentence, or even multiple sentences. Speaker Dependence/ independent: The Speaker dependent system is designed for a specific speaker. The system’s performance is highly accurate for the correct speaker, but much less accurate for other speakers. They assume the speaker will speak in a consistent voice. The Speaker independent systems are designed for a variety of speakers. Since different users will have different voices and accents, the system accuracy is less than the speaker dependent accuracy Vocabularies: The Vocabularies are lists of words or utterances that can be recognized by the speech recognition system. Smaller vocabularies are easier for a computer to recognize, while the large vocabularies are harder. The entry of the vocabulary can be as long as a sentence or two. Accuracy: The accuracy of a recognizer can be examined by its ability of identifying an utterance and if the spoken utterance is not in its vocabulary. The vocabulary size affects the accuracy of the recognizer. Good speech recognizer systems can have an accuracy of 98% or more! The acceptable accuracy of a system really depends on the application. Training Some speech recognizers have the ability to adapt to a speaker. A speech recognition system is trained by having the speaker repeat standard or common phrases and adjusting its comparison algorithms to match that particular speaker. Training a recognizer usually improves its accuracy. The events of a keyboard, a mouse, a timer and socket are instantaneous in time1. But the speech events are not for two reasons. 1 There is a defined instant at which they occur 1. Firstly, Speaking a sentence takes time, thus, it is not instantaneous. The recognizer will start at the start of the speech, and then produce the recognition result as soon as possible after the end of the speech. 2. Secondly, recognizers cannot always recognize words immediately when they are spoken, beside recognizers cannot determine immediately when a user has stopped speaking2. [java speech programmer guide speech recognition] The speech recognition can be broke down into several different classes depending on the type of utterances they can recognize, and the ability for the speech recognizer to know when the speaker starts and finishes an utterance. Those classes are Isolated Words: this class requires a single utterance at a time. Often, these systems can have one of the two states "Listen/Not-Listen". The Isolated words require the speaker to pause between utterances (usually doing processing during the pauses). Isolated Utterance might be a better name for this class. Connected Words: Connect word systems are similar to Isolated words except that they allow separate utterances to be run-together with a minimal pause between them. Continuous Speech: The continuous speech recognizers allow users to speak almost naturally, while the computer determines the content. They are some of the most difficult recognizers to create because they must utilize special methods to determine utterance boundaries Spontaneous Speech: the spontaneous speech actually is a speech that is natural sounding and not rehearsed. A spontaneous speech recognition system should be able to handle a variety of natural speech features such as words being run together, "ums" and ". Voice Verification/Identification: the speech recognition systems can identify specific users by their voices. 2.3.1 Speech Input Event Cycle [java speech programmer guide speech recognition] The typical recognition state cycle for a Recognizer occurs as speech input occurs in order to represents the recognition of a single Result. The Recognizer inherits its basic state systems defined in the javax.speech package through the Engine interface. The basic engine state systems are 2 The same principals are generally true of human perception of speech. LISTENING: The Recognizer starts in the LISTENING state with a certain set of active grammars. PROCESSING: SUSPENDED: The recognizer indicates completion of recognition by issuing a RECOGNIZER_SUSPENDED event in order to transition from the PROCESSING state to the SUSPENDED state. And then, the recognizer issues a result finalization event to ResultListeners to indicate that all information about the result is finalized. The Recognizer remains in the SUSPENDED state until processing of the result finalization event is completed. Lastly The Recognizer issues a CHANGES_COMMITTED event to return to the LISTENING state. The recognizer is in listening state until detecting any incoming audio, which might match an active grammar. When the Recognizer transitions from the LISTENING state to the PROCESSING state with a RECOGNIZER_PROCESSING event. The Recognizer remains in the PROCESSING state until it completes recognition of the result. While in the PROCESSING state the Result may be updated with new information. In the SUSPENDED state the incoming speech data will not be lost because, the Recognizer buffers this incoming audio. When the Recognizer returns to the LISTENING state the buffered audio is processed to give the user the perception of real-time processing. So the SUSPENDED state serves as a temporary state in which recognizer configuration can be updated without loosing audio data In this event cycle the RECOGNIZER_PROCESSING and RECOGNIZER_SUSPENDED events are triggered by the user’s action: starting and stopping speaking. While CHANGES_COMMITTED event is triggered programmatically some time after the RECOGNIZER_SUSPENDED event. The speech event cycle is shown in the figure Proccessing Recognizer Procedding Committed-Change Recognizer-Suspend Listening Suspended Committed-Change Figure 2 Speech event cycle 2.3.1 IBM ViaVoice The IBM ViaVoice engine provides developer with the necessary tolls in order to develop applications that incorporate speech. IBM ViaVoice engine includes many application-programming interfaces (APIs) that allow an application to access a speech resources and the developer to manage what the user can say in the application. The IBM ViaVoice engine includes both the recognizer that supports speech to text conversion (STT) and synthesizer that support voice command recognition, dictation and text-to-speech conversion (TTS). The Java Speech API (JSAPI) is required to provide a cross-platform interface to the speech engines. The JSAPI contain three packages: javax.speech, javax.speech.recognition and javax.speech.synthesis. Moreover IBM speech for Java works as a java-programming interface for incorporating IBM ViaVoice technology. The different layers for an application in order to access and use the IBM ViaVoice engine are shown in figure. Figure 2-7 The relation and layers of an application and IBM ViaVoice Speech Engine Synthesizer [tts-us-en readme description ] IBM Text-to-Speech provides the speech synthesis engine and components that necessary for applications to produce speech. The human recorded speech units will be combined according to linguistic rules that formulated from analyzed text in order to produce a natural human sounding speech. The Speech synthesis engine and data include capability for two types of speech presentation, a concatenate voice dataset and a computer synthesized voice known as format synthesis. The concatenate voice dataset is a speech presentation spoken by a professional speaker, speaking a particular language and accent, recorded at a particular sampling rate. In case the user changes the language a new voice dataset will be loaded into the memory if it is not cached there. For example, if the user is using English at 8KHz with voice 1 and U.S. English voice 1 at 8Khz has been installed, then the system will automatically do concatenate synthesis. Otherwise, the system will do formant synthesis. Since the speech synthesizers do not understand what they say, they do not always use the right style or phrasing and do not provide the same naturalness as people. The Java Speech API Markup Language (JSML) allows applications to provide the spoken text with additional information in order to improve the quality and naturalness of the synthesized speech. [jsml] JSML is an XML Application that defines a specific set of elements to mark up the spoken text. Besides, JSML defines the explanations of those elements so there will be a common understanding between synthesizers and documents producers of how to speak the marked up text. The JSML element set includes several types of element such as: structural elements to mark paragraphs and sentences, Control elements to control the pronunciation of words and phrases, the emphasis of words3, the placements of boundaries and pauses, and the control of speaking rate, and the last JSML elements is representing markers embedded (fixed) in text to enable synthesizer-specific controls. [jsml] 3 If the words are stressing or accenting Recognizer [developer part for IBM ] The speech recognition is the heart of the speech recognition. Speech recognition systems provide computers with the ability to listen to user speech, determine what is said and then translates it into a text that the application can understand; this text will then be used by the application. The speech recognizer engine works with some resources such as user’s language of origin and domain. The s user’s language of origin is the language used by the speakers. IBM Viavoice on window support many languages such as: Us English, UK English, German, French, Italian, Arabic, Japanese and Chinese. On the other hand the domain is a set of vocabulary and word-usage module designed to support an application. Both the vocabulary and the word-usage module are used together by the speech engine to decode speech for the applications. In general each application consist of different parts, for each of those parts the user will say something different, so an important aspect for the application is to know what the user wants to say. The words or phrases that the user can say called vocabulary. The accuracy and speed of recognition those vocabulary depends on the their size. The speech recognizer constrains that vocabulary by using grammar in order to achieve reasonable recognition accuracy and response time. The grammar is a collection of words and phrases that bounded together by rules that define the set of all the speech that represent a complete command that can be recognized by the speech engine at specific time. The grammar consists of header and body. The header declares the grammar name and lists the imported rules and grammars. The body defines the grammar’s rules as combinations of spoken text and references to other rules. grammar javax.speech.demo; The Header public <sentence> = good morning | good bye | Yes; The Body The grammar can be created using a plain text editor and specified by using specialized speech recognition control language or SRCL4. When compiling the grammar file the grammar compiler converts it into a binary file that can be used by the speech engine that will determine the existing phrases and words for the user to say. When the developer design their own grammar files they have to take into the account the following aspects 4 Designed as joint effort between the SRAPI (speech recognizer API) and ECTF (Enter Price Computer Technology Forum) 1. The number of phrases is small in order to have fast and more accurate speech recognition. Moreover limiting the size make the process of matching word easier. 2. Developing long and narrow grammar is better than short and wide grammar. 3. It is important to avoid using very similar pronunciation5 words at the same portion of the grammar. 4. Not to allow the users to say thing in multiple way cause that will enhance usability. 5. A voiding supporting many ways of saying the command so the user wont remember what to say. Since the user might speck some undefined words or the background environment might be noisy so the recognizing engine might accept or reject the spoken words. The spoken vocabulary can be rejected when the time out is exceeded; the user stopped before completing his phrase or the source of the word is low relative to the threshold setting6. Lastly the user might not always speak using an exact continues speech. They might pause or interject extraneousness speech it to their vocabulary. Those occurrences are called embedded silence. The developers can request the engine to handle the embedded silence for their grammar. 2.3.2 java speech API [java speech api speech recognition 4.6 and introduction] The Java Speech API (JSAPI) was developed by Sun Microsoft in cooperation with companies dealing with speech technologies such IBM and Lernout & Hauspie in order to provide a cross-platform interface to the speech engines. JSAPI is designed in order to access the speech recognizer and synthesizer, to keep simple speech applications simple and to make advanced speech applications possible for non-specialist developers. The javax.speech.recognition package defines the Recognizer interface to support speech recognition, where javax.speech.synthesis define the synthesis interface to support speech synthesizing. Much of the their functionality inherited from the Engine interface in the javax.speech package. The Java Speech API supports two types of grammars: rule grammars and dictation grammars. These grammars are different in how patterns of words are defined and in their programmatic use. A dictation grammar is built into a recognizer. It defines a huge set of words7 that might be spoken in an unrestricted way. Beside, dictation grammars are more flexible than rule grammars. A rule grammar is provided to a recognizer by an application in order to define a set of rules that indicates what a user may say. Similar words can be like using “cat and hat” Those setting are available through ViaVoice proprieties in the Windows control panel 7 The words set can possibly be tens of thousands of words 5 6 Rules can be defined by tokens, references to other rules and by logical combinations of both. Rule grammars can capture a wide range of spoken input from users using combination of simple grammars and rules.