Introduction The availability of inexpensive, yet powerful, microprocessors and digital signal processing (DSP) chips has enabled technology manufacturers to introduce voice recognition (VR) technology into products such as biometric sensors, hands-free dictation devices, software interaction products and robotics control. This paper focuses on common commercial applications of voice recognition, as well as the underlying signal processing and typical implementations of VR technology. Commercial Applications of Voice Recognition In terms of commercial market sales, server-based call center voice recognition accounts for roughly 3/5 of the VR market [1]. In this configuration, high speed central servers running VR software allows callers to dictate information, such as addresses, bank accounts, and extension numbers. Previously, such tasks were handled by a human operator or touch-tone telephone pad. Voice recognition software has allowed call centers to decrease both average wait times [1] and call costs [1] Comprising approximately 1/5 of the VR market [1], biometric security systems also utilize voice recognition technology. In this application, speaker and pattern recognition software analyzes and compares a user’s voice to a stored sampling of authorized voices. As a security measure, text-independent, multi-lingual voice recognition demonstrates accuracy as great as 98.9% under ideal conditions, comparable to fingerprint indentification [2]. Voice recognition technology has also facilitated the development of hands-free voice-command devices. Already widespread in Bluetooth headsets, luxury automobiles and navigation devices, the sales of voice controlled devices is expected to quadruple by 2010 [1]. These applications are characterized by the use of 16- or 32- bit microprocessors embedded with existing VR software [3]. While the simplest voice command devices simply rely on a database of user-given commands, the most advanced technology, known as Speaker Independent, allows full “out of the box” functionality for untrained users, regardless of environmental noise, regional accent, or speaker intonation [4]. 1 Underlying Technology of Voice Recognition Voice recognition is made possible by a mathematical algorithm known as a Fast Fourier Transform (FFT) [5]. In digital signal processing, a FFT is able to efficiently decompose any periodic waveform (such as a speech sample) into a sum of simple sinusoidal functions, which are then transformed into a series of finite-domain, discrete time signals [5]. Application of the Fast Fourier Transform algorithm allows microprocessors to quickly and efficiently convert analog sound signals into easily manipulated digital signals. Following any signal processing, an inverse FFT is then performed on the discrete digital signal to transform it back into a periodic waveform [5]. Following the FFT, further signal processing must be performed for speaker independent voice recognition applications. Modern VR systems are generally based on a statistical voice recognition process known as the Hidden Markov Model (HMM) [6]. This process decomposes human speech into phonemes, the smallest (~ 10 milliseconds) meaningful fragment of speech [6]. By decomposing a speech sample into a string of phonemes, comparing them to an database of known commands, and finding the most probable match, the HMM is able accurately recognize human speech regardless of speaker accent or intonation [7]. Implementation of Voice Recognition Voice Recognition technology for voice command devices typically follow a prepackaged, “black box” approach. VR chips are widely available in a compact, Dual InLine (DIP) device, coupled with an 8, 16, or 32 bit embedded microprocessor [4]. Such an approach saves time by allowing manufacturers to avoid developing complex DSP algorithms. “Black box” VR chips typically begin with a 16-bit Analog-to-Digital Converter (ADC), that converts the analog soundwave input from the microphone and converts it to a 16- bit digital signal. A Fast Fourier Transform is then performed by an internal microprocessor to convert the digital signal to discrete-time, allowing for more efficient signal processing [4]. 2 Following the FFT, VR chips typically follow one of two approaches, depending on the type of chips. The simplest form of voice recognition is known as SpeakerDependent (SD). While such chips avoid further costly and time-consuming digital signal processing, a period of “training” is required before the chip is fully functional [3]. This usually requires the user to repeat several command phrases, which are then stored in internal memory. SD chips then access this database following the FFT, and find the most appropriate match. The SD “black box” chip then typically performs a Digital-to-Analog Conversion, and outputs the analog signal to the appropriate pin. Speaker Independent (SI) chips, while costlier and slightly slower than SD chips, permits immediate functionality to untrained users. Also, SI chips often support and recognize multiple languages, and are usually unaffected by regional accents, background noise, or speaker intonation [7]. This is made possible by further signal processing performed upon the Fourier transformed signal, typically performed by an internal 32-bit DSP chip [7]. The DSP chip applies the Hidden Markov Model to the transformed signal, statistically calculating the most likely phoneme match, and matching it to an internal database of words. This database is typically a 1 MB, on-chip flash memory device. External memory allows greater word recognition capabilities, and allows for multilingual support [2]. Manufacturers typically utilize one or more additional microprocessors to perform the desired action with the VR chip output. Conclusion Voice Recognition technology sales are already a one billion dollar annual market [1], and will continue to grow as microprocessors become more powerful and less expensive. Such technology enables hands-free interaction with automobiles, robotics, customer service call centers and biometric security systems. As VR technology advances, it will continue to allow a new degree of interaction and control with electronic and mechanical devices. 3 References [1] Ian Lamont, “Speech recognition technology will hear you now,” Popular Science, June, pp. 26-27, 2005. [2] D. A. Reynolds, W. Campbell, and A. Adami, “The 2004 MIT Lincoln laboratory speaker recognition system”, presented at IEEE Int. Conf: Acoustics, Speech, and Signal Processing. Philadelphia, PA, Mar. 2005 [3] Joseph P. Campell, “Speaker Recognition: A Tutorial,” Proc. IEEE, vol. 64, pp. 275-291, 2002. [4] R. Schwartz, S. Roucos, and M. Berouti, “The application of probability density estimation to text independent speaker identification,” in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Paris, France, 1982, pp. 1649–1652. [5] L. Rabiner and R. Schafer, Digital Processing of Speech Signals, A. Oppenheim, Series Ed. Englewood Cliffs, NJ: Prentice-Hall, 1978. [6] D. Reynolds and R. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Speech Audio Processing, vol. 3, no. 1, pp. 72–83, 1995. [7] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, pp. 257–286, Feb. 1989. 4 5