Section X: Speech Processing

advertisement
X
Speech Processing
Richard V. Cox
AT&T Labs — Research
Lawrence R. Rabiner
AT&T Labs — Research
44 Speech Production Models and Their Digital Implementations
and Juergen Schroeter
M. Mohan Sondhi
Introduction • Geometry of the Vocal and Nasal Tracts • Acoustical Properties of the Vocal and
Nasal Tracts • Sources of Excitation • Digital Implementations
45 Speech Coding
Richard V. Cox
Introduction • Useful Models for Speech and Hearing • Types of Speech Coders • Current Standards
46 Text-to-Speech Synthesis
Richard Sproat and Joseph Olive
Introduction • Text Analysis and Linguistic Analysis • Speech Synthesis • The Future of TTS
47 Speech Recognition by Machine
Lawrence R. Rabiner and B. H. Juang
Introduction • Characterization of Speech Recognition Systems • Sources of Variability of Speech
• Approaches to ASR by Machine • Speech Recognition by Pattern Matching • Connected Word
Recognition • Continuous Speech Recognition • Speech Recognition System Issues • Practical Issues
in Speech Recognition • ASR Applications
48 Speaker Verification
Sadaoki Furui and Aaron E. Rosenberg
Introduction • Personal Identity Characteristics • Vocal Personal Identity Characteristics • Basic
Elements of a Speaker Recognition System • Extracting Speaker Information from the Speech Signal
• Feature Similarity Measurements • Units of Speech for Representing Speakers • Input Modes •
Representations • Optimizing Criteria for Model Construction • Model Training and Updating •
Signal Feature and Score Normalization Techniques • Decision Process • Outstanding Issues
49 DSP Implementations of Speech Processing
Kurt Baudendistel
Software Development Targets • Software Development Paradigms • Assembly Language Basics •
Arithmetic • Algorithmic Constructs
50 Software Tools for Speech Research and Development
John Shore
Introduction • Historical Highlights • The User’s Environment (OS-Based vs. Workspace-Based)
• Compute-Oriented vs. Display-Oriented • Compiled vs. Interpreted • Specifying Operations
Among Signals • Extensibility (Closed vs. Open Systems) • Consistency Maintenance • Other
Characteristics of Common Approaches • File Formats (Data Import/Export) • Speech Databases
• Summary of Characteristics and Uses • Sources for Finding Out What is Currently Available •
Future Trends
1999 by CRC Press LLC
c
W
ITH THE ADVENT OF CHEAP, HIGH SPEED PROCESSORS, and with the everdecreasing cost of memory, the cost of speech processing has been driven down to the
point where it can be (and has been) embedded in almost any system, from a low cost
consumer product (e.g., solid-state digital answering machines, voice controlled telephones, etc.),
to a desktop application (e.g., voice dictation of a first draft quality manuscript), to an application
embedded in a voice or data network (e.g., voice dialing, packet telephony, voice browser for the
Internet, etc.). It is the purpose of this section of the Handbook to provide discussions of several
of the key technologies in speech processing and to illustrate how the technologies are implemented
using special-purpose DSP processor chips or via standard software packages running on more conventional processors.
The broad area of speech processing can be broken down into several individual areas according
to both applications and technology. These include:
1. Speech Production Models and their Digital Implementations (see Chapter 44 by Sondhi and
Schroeter). In order to understand how the characteristics of a speech signal can be exploited
in the different application areas, it is necessary to understand the properties and constraints
of the human vocal apparatus (to understand how speech is generated by humans). It is also
necessary to understand the way in which models can be built that simulate speech production
as well as the ways in which they can be implemented as digital systems, since such models
form the basis for almost all practical speech processing systems.
2. Speech Coding (see Chapter 45 by Cox). Speech coding is the process of compressing the
information in a speech signal so as to either transit it or store it economically over a channel
whose bandwidth is significantly smaller than that of the uncompressed signal. Speech coding is
used as the basis for most modern voice messaging and voice mail systems, for voice response
systems, for digital cellular and for satellite transmission of speech, for packet telephony,
for ISDN teleconferencing, and for digital answering machines and digital voice encryption
machines.
3. Text-to-Speech Synthesis (see Chapter 46 by Sproat and Olive). Speech synthesis is the process
of creating a synthetic replica of a speech signal so as to transmit a message from a machine
to a person, with the purpose of conveying the information in the message. Speech synthesis
is often called “text-to-speech” or TTS, to convey the idea that, in general, the input to the
system is ordinary ASCII text, and the output of the system is ordinary speech. The goal of
most speech synthesis systems is to provide a broad range of capability for having a machine
speak information (stored in the machine) to a user. Key aspects of synthesis systems are the
intelligibility and the naturalness of the resulting speech. The major applications of speech
synthesis include acting as a voice server for text-based information services (e.g., stock prices,
sports scores, flight information); providing a means for reading e-mail, or the text portions
of FAX messages over ordinary phone lines; providing a means for previewing text stored in
documents (e.g., document drafts, Internet files); and finally as a voice readout for handheld
devices, (e.g., phrase book translators, dictionaries, etc.)
4. Speech Recognition by Machine (see Chapter 47 by Rabiner and Juang). Speech recognition
is the process of extracting the message information in a speech signal so as to control the
action of a machine in response to spoken commands. In a sense, speech recognition is the
complementary process to speech synthesis, and together they constitute the building blocks
of a voice dialogue system with a machine. There are many factors which influence the type
of speech recognition system that is used for different applications, including the mode of
speaking to the machine (e.g., single commands, digit sequences, fluent sentences), the size
and complexity of the vocabulary which the machine understands, the task which the machine
1999 by CRC Press LLC
c
is asked to accomplish, the environment in which the recognition system must run, and finally
the cost of the system. Although there is a wide range of applications of speech recognition
systems, the most generic systems are simple “command-and-control” systems (with menulike interfaces), and the most advanced systems support full voice dialogues for dictation, forms
entry, catalog ordering, reservation services, etc.
5. Speaker Verification (see Chapter 48 by Furui and Rosenberg). Speaker verification is the
process of verifying the claimed identity of a speaker for the purpose of restricting access
to information (e.g., personal or private records), networks (computer, PBX), or physical
premises. The basic problem of speaker verification is to decide whether or not an unknown
speech sample was spoken by the individual whose identity was claimed. A key aspect of any
speaker verification system is to accept the true speaker as often as possible while rejecting the
impostor as often as possible. Since these are inherently conflicting goals, all practical systems
arrive at some compromise between levels of these two types of system errors. The major
area of application for speaker verification is in access control to information, credit, banking,
machines, computer networks, private branch exchanges (PBX’s), and even premises. The
concept of a “voice lock” that prevents access until the appropriate speech by the authorized
individual(s) (e.g., “Open Sesame”) is “heard” by the system is made a reality using speaker
verification technology.
6. DSP Implementations of Speech Processing (see Chapter 49 by Baudendistel). Until a few
years ago, almost all speech processing systems were implemented on low-cost DSP fixed-point
processors because of their high efficiency in realizing the computational aspects of the various
signal processing algorithms. A key problem in the realization of any digital system in integer
DSP code is how to map an algorithm efficiently (in both time and space) which is typically
running in floating point C code on a workstation to integer C code that takes advantage of the
unique characteristics of different DSP chips. Furthermore, because of the rate of change of
technology, it is essential that the conversion to DSP code occur rapidly (e.g., on the order of
3-person months) or else by the time a given algorithm is mapped to a specific DSP processor,
a new (faster, cheaper) generation of DSP chips will have evolved, obsoleting the entire process.
7. Software Tools for Speech Research and Development (see Chapter 50 by Shore). The field
of speech processing has become a complex one, where an investigator needs a broad range
of tools to record, digitize, display, manipulate, process, store, format, analyze, and listen
to speech in its different file forms and manifestations. Although it is conceivable that an
individual could create a suite of software tools for an individual application, that process
would be highly inefficient and would undoubtedly result in tools which were significantly less
powerful than those developed in the commercial sector, such as the Entropic Signal Processing
System, MATLAB, Waves, Interactive Laboratory System (ILS), or the commercial packages
for TTS and speech recognition such as the Hidden Markov Model Toolkit (HTK).
The material presented in this section should provide the reader with a framework for understanding the signal processing aspects of speech processing and some pointers into the literature for further
investigation of this fascinating and rapidly evolving field.
1999 by CRC Press LLC
c
Download