X Speech Processing Richard V. Cox AT&T Labs — Research Lawrence R. Rabiner AT&T Labs — Research 44 Speech Production Models and Their Digital Implementations and Juergen Schroeter M. Mohan Sondhi Introduction • Geometry of the Vocal and Nasal Tracts • Acoustical Properties of the Vocal and Nasal Tracts • Sources of Excitation • Digital Implementations 45 Speech Coding Richard V. Cox Introduction • Useful Models for Speech and Hearing • Types of Speech Coders • Current Standards 46 Text-to-Speech Synthesis Richard Sproat and Joseph Olive Introduction • Text Analysis and Linguistic Analysis • Speech Synthesis • The Future of TTS 47 Speech Recognition by Machine Lawrence R. Rabiner and B. H. Juang Introduction • Characterization of Speech Recognition Systems • Sources of Variability of Speech • Approaches to ASR by Machine • Speech Recognition by Pattern Matching • Connected Word Recognition • Continuous Speech Recognition • Speech Recognition System Issues • Practical Issues in Speech Recognition • ASR Applications 48 Speaker Verification Sadaoki Furui and Aaron E. Rosenberg Introduction • Personal Identity Characteristics • Vocal Personal Identity Characteristics • Basic Elements of a Speaker Recognition System • Extracting Speaker Information from the Speech Signal • Feature Similarity Measurements • Units of Speech for Representing Speakers • Input Modes • Representations • Optimizing Criteria for Model Construction • Model Training and Updating • Signal Feature and Score Normalization Techniques • Decision Process • Outstanding Issues 49 DSP Implementations of Speech Processing Kurt Baudendistel Software Development Targets • Software Development Paradigms • Assembly Language Basics • Arithmetic • Algorithmic Constructs 50 Software Tools for Speech Research and Development John Shore Introduction • Historical Highlights • The User’s Environment (OS-Based vs. Workspace-Based) • Compute-Oriented vs. Display-Oriented • Compiled vs. Interpreted • Specifying Operations Among Signals • Extensibility (Closed vs. Open Systems) • Consistency Maintenance • Other Characteristics of Common Approaches • File Formats (Data Import/Export) • Speech Databases • Summary of Characteristics and Uses • Sources for Finding Out What is Currently Available • Future Trends 1999 by CRC Press LLC c W ITH THE ADVENT OF CHEAP, HIGH SPEED PROCESSORS, and with the everdecreasing cost of memory, the cost of speech processing has been driven down to the point where it can be (and has been) embedded in almost any system, from a low cost consumer product (e.g., solid-state digital answering machines, voice controlled telephones, etc.), to a desktop application (e.g., voice dictation of a first draft quality manuscript), to an application embedded in a voice or data network (e.g., voice dialing, packet telephony, voice browser for the Internet, etc.). It is the purpose of this section of the Handbook to provide discussions of several of the key technologies in speech processing and to illustrate how the technologies are implemented using special-purpose DSP processor chips or via standard software packages running on more conventional processors. The broad area of speech processing can be broken down into several individual areas according to both applications and technology. These include: 1. Speech Production Models and their Digital Implementations (see Chapter 44 by Sondhi and Schroeter). In order to understand how the characteristics of a speech signal can be exploited in the different application areas, it is necessary to understand the properties and constraints of the human vocal apparatus (to understand how speech is generated by humans). It is also necessary to understand the way in which models can be built that simulate speech production as well as the ways in which they can be implemented as digital systems, since such models form the basis for almost all practical speech processing systems. 2. Speech Coding (see Chapter 45 by Cox). Speech coding is the process of compressing the information in a speech signal so as to either transit it or store it economically over a channel whose bandwidth is significantly smaller than that of the uncompressed signal. Speech coding is used as the basis for most modern voice messaging and voice mail systems, for voice response systems, for digital cellular and for satellite transmission of speech, for packet telephony, for ISDN teleconferencing, and for digital answering machines and digital voice encryption machines. 3. Text-to-Speech Synthesis (see Chapter 46 by Sproat and Olive). Speech synthesis is the process of creating a synthetic replica of a speech signal so as to transmit a message from a machine to a person, with the purpose of conveying the information in the message. Speech synthesis is often called “text-to-speech” or TTS, to convey the idea that, in general, the input to the system is ordinary ASCII text, and the output of the system is ordinary speech. The goal of most speech synthesis systems is to provide a broad range of capability for having a machine speak information (stored in the machine) to a user. Key aspects of synthesis systems are the intelligibility and the naturalness of the resulting speech. The major applications of speech synthesis include acting as a voice server for text-based information services (e.g., stock prices, sports scores, flight information); providing a means for reading e-mail, or the text portions of FAX messages over ordinary phone lines; providing a means for previewing text stored in documents (e.g., document drafts, Internet files); and finally as a voice readout for handheld devices, (e.g., phrase book translators, dictionaries, etc.) 4. Speech Recognition by Machine (see Chapter 47 by Rabiner and Juang). Speech recognition is the process of extracting the message information in a speech signal so as to control the action of a machine in response to spoken commands. In a sense, speech recognition is the complementary process to speech synthesis, and together they constitute the building blocks of a voice dialogue system with a machine. There are many factors which influence the type of speech recognition system that is used for different applications, including the mode of speaking to the machine (e.g., single commands, digit sequences, fluent sentences), the size and complexity of the vocabulary which the machine understands, the task which the machine 1999 by CRC Press LLC c is asked to accomplish, the environment in which the recognition system must run, and finally the cost of the system. Although there is a wide range of applications of speech recognition systems, the most generic systems are simple “command-and-control” systems (with menulike interfaces), and the most advanced systems support full voice dialogues for dictation, forms entry, catalog ordering, reservation services, etc. 5. Speaker Verification (see Chapter 48 by Furui and Rosenberg). Speaker verification is the process of verifying the claimed identity of a speaker for the purpose of restricting access to information (e.g., personal or private records), networks (computer, PBX), or physical premises. The basic problem of speaker verification is to decide whether or not an unknown speech sample was spoken by the individual whose identity was claimed. A key aspect of any speaker verification system is to accept the true speaker as often as possible while rejecting the impostor as often as possible. Since these are inherently conflicting goals, all practical systems arrive at some compromise between levels of these two types of system errors. The major area of application for speaker verification is in access control to information, credit, banking, machines, computer networks, private branch exchanges (PBX’s), and even premises. The concept of a “voice lock” that prevents access until the appropriate speech by the authorized individual(s) (e.g., “Open Sesame”) is “heard” by the system is made a reality using speaker verification technology. 6. DSP Implementations of Speech Processing (see Chapter 49 by Baudendistel). Until a few years ago, almost all speech processing systems were implemented on low-cost DSP fixed-point processors because of their high efficiency in realizing the computational aspects of the various signal processing algorithms. A key problem in the realization of any digital system in integer DSP code is how to map an algorithm efficiently (in both time and space) which is typically running in floating point C code on a workstation to integer C code that takes advantage of the unique characteristics of different DSP chips. Furthermore, because of the rate of change of technology, it is essential that the conversion to DSP code occur rapidly (e.g., on the order of 3-person months) or else by the time a given algorithm is mapped to a specific DSP processor, a new (faster, cheaper) generation of DSP chips will have evolved, obsoleting the entire process. 7. Software Tools for Speech Research and Development (see Chapter 50 by Shore). The field of speech processing has become a complex one, where an investigator needs a broad range of tools to record, digitize, display, manipulate, process, store, format, analyze, and listen to speech in its different file forms and manifestations. Although it is conceivable that an individual could create a suite of software tools for an individual application, that process would be highly inefficient and would undoubtedly result in tools which were significantly less powerful than those developed in the commercial sector, such as the Entropic Signal Processing System, MATLAB, Waves, Interactive Laboratory System (ILS), or the commercial packages for TTS and speech recognition such as the Hidden Markov Model Toolkit (HTK). The material presented in this section should provide the reader with a framework for understanding the signal processing aspects of speech processing and some pointers into the literature for further investigation of this fascinating and rapidly evolving field. 1999 by CRC Press LLC c