Introduction

advertisement
Introduction
The availability of inexpensive, yet powerful, microprocessors and digital signal
processing (DSP) chips has enabled technology manufacturers to introduce voice
recognition (VR) technology into products such as biometric sensors, hands-free dictation
devices, software interaction products and robotics control.
This paper focuses on common commercial applications of voice recognition, as
well as the underlying signal processing and typical implementations of VR technology.
Commercial Applications of Voice Recognition
In terms of commercial market sales, server-based call center voice recognition
accounts for roughly 3/5 of the VR market [1]. In this configuration, high speed central
servers running VR software allows callers to dictate information, such as addresses,
bank accounts, and extension numbers. Previously, such tasks were handled by a human
operator or touch-tone telephone pad. Voice recognition software has allowed call centers
to decrease both average wait times [1] and call costs [1]
Comprising approximately 1/5 of the VR market [1], biometric security systems
also utilize voice recognition technology. In this application, speaker and pattern
recognition software analyzes and compares a user’s voice to a stored sampling of
authorized voices. As a security measure, text-independent, multi-lingual voice
recognition demonstrates accuracy as great as 98.9% under ideal conditions, comparable
to fingerprint indentification [2].
Voice recognition technology has also facilitated the development of hands-free
voice-command devices. Already widespread in Bluetooth headsets, luxury automobiles
and navigation devices, the sales of voice controlled devices is expected to quadruple by
2010 [1]. These applications are characterized by the use of 16- or 32- bit
microprocessors embedded with existing VR software [3]. While the simplest voice
command devices simply rely on a database of user-given commands, the most advanced
technology, known as Speaker Independent, allows full “out of the box” functionality for
untrained users, regardless of environmental noise, regional accent, or speaker intonation
[4].
1
Underlying Technology of Voice Recognition
Voice recognition is made possible by a mathematical algorithm known as a Fast
Fourier Transform (FFT) [5]. In digital signal processing, a FFT is able to efficiently
decompose any periodic waveform (such as a speech sample) into a sum of simple
sinusoidal functions, which are then transformed into a series of finite-domain, discrete
time signals [5]. Application of the Fast Fourier Transform algorithm allows
microprocessors to quickly and efficiently convert analog sound signals into easily
manipulated digital signals. Following any signal processing, an inverse FFT is then
performed on the discrete digital signal to transform it back into a periodic waveform [5].
Following the FFT, further signal processing must be performed for speaker
independent voice recognition applications. Modern VR systems are generally based on a
statistical voice recognition process known as the Hidden Markov Model (HMM) [6].
This process decomposes human speech into phonemes, the smallest (~ 10 milliseconds)
meaningful fragment of speech [6]. By decomposing a speech sample into a string of
phonemes, comparing them to an database of known commands, and finding the most
probable match, the HMM is able accurately recognize human speech regardless of
speaker accent or intonation [7].
Implementation of Voice Recognition
Voice Recognition technology for voice command devices typically follow a prepackaged, “black box” approach. VR chips are widely available in a compact, Dual InLine (DIP) device, coupled with an 8, 16, or 32 bit embedded microprocessor [4]. Such
an approach saves time by allowing manufacturers to avoid developing complex DSP
algorithms.
“Black box” VR chips typically begin with a 16-bit Analog-to-Digital Converter
(ADC), that converts the analog soundwave input from the microphone and converts it to
a 16- bit digital signal. A Fast Fourier Transform is then performed by an internal
microprocessor to convert the digital signal to discrete-time, allowing for more efficient
signal processing [4].
2
Following the FFT, VR chips typically follow one of two approaches, depending
on the type of chips. The simplest form of voice recognition is known as SpeakerDependent (SD). While such chips avoid further costly and time-consuming digital signal
processing, a period of “training” is required before the chip is fully functional [3]. This
usually requires the user to repeat several command phrases, which are then stored in
internal memory. SD chips then access this database following the FFT, and find the most
appropriate match. The SD “black box” chip then typically performs a Digital-to-Analog
Conversion, and outputs the analog signal to the appropriate pin.
Speaker Independent (SI) chips, while costlier and slightly slower than SD chips,
permits immediate functionality to untrained users. Also, SI chips often support and
recognize multiple languages, and are usually unaffected by regional accents, background
noise, or speaker intonation [7]. This is made possible by further signal processing
performed upon the Fourier transformed signal, typically performed by an internal 32-bit
DSP chip [7]. The DSP chip applies the Hidden Markov Model to the transformed signal,
statistically calculating the most likely phoneme match, and matching it to an internal
database of words. This database is typically a 1 MB, on-chip flash memory device.
External memory allows greater word recognition capabilities, and allows for
multilingual support [2]. Manufacturers typically utilize one or more additional
microprocessors to perform the desired action with the VR chip output.
Conclusion
Voice Recognition technology sales are already a one billion dollar annual market
[1], and will continue to grow as microprocessors become more powerful and less
expensive. Such technology enables hands-free interaction with automobiles, robotics,
customer service call centers and biometric security systems. As VR technology
advances, it will continue to allow a new degree of interaction and control with electronic
and mechanical devices.
3
References
[1]
Ian Lamont, “Speech recognition technology will hear you now,”
Popular Science, June, pp. 26-27, 2005.
[2]
D. A. Reynolds, W. Campbell, and A. Adami, “The 2004 MIT Lincoln laboratory
speaker recognition system”, presented at IEEE Int. Conf: Acoustics, Speech, and
Signal Processing. Philadelphia, PA, Mar. 2005
[3]
Joseph P. Campell, “Speaker Recognition: A Tutorial,” Proc. IEEE, vol. 64,
pp. 275-291, 2002.
[4]
R. Schwartz, S. Roucos, and M. Berouti, “The application of probability density
estimation to text independent speaker identification,” in Proc. Int. Conf.
Acoustics, Speech, and Signal Processing, Paris, France, 1982, pp. 1649–1652.
[5]
L. Rabiner and R. Schafer, Digital Processing of Speech Signals,
A. Oppenheim, Series Ed. Englewood Cliffs, NJ: Prentice-Hall, 1978.
[6]
D. Reynolds and R. Rose, “Robust text-independent speaker
identification using Gaussian mixture speaker models,” IEEE
Trans. Speech Audio Processing, vol. 3, no. 1, pp. 72–83, 1995.
[7]
L. R. Rabiner, “A tutorial on hidden Markov models and
selected applications in speech recognition,” Proc. IEEE, vol.
77, pp. 257–286, Feb. 1989.
4
5
Download