How It Works: Speech Recognition Speech recognition software is better and less expensive than ever. Find out how your words go from voice to text on the screen. Stan Miastkowski - Apr 14, 2000 4:30 pm Speech recognition: a technology that transforms spoken words into alphanumeric text and navigational commands that can be recognized by a PC. For years, speech recognition has been the poster child for technology that never lived up to its promise. Only three years ago, the products were expensive, inaccurate, and hard to use. That's changing. Fast PCs and ingenious software improvements mean that speech recognition technology finally offers real benefits. And it's appearing in places you might not have expected, including your mobile phone. Want to compose e-mail or surf the Web? All you'll have to do is talk. Here's what you need to know: * You can dictate text into applications and control your desktop with up to 95 percent accuracy. * Speech recognition software requires a fast CPU, plenty of RAM, a good microphone, and a good sound card. * New developments let you take speech recognition to the Internet and even beyond your PC. A computer doesn't speak your language, so it must transform your words into something it can understand. A microphone converts your voice into an analog signal and feeds it to your PC's sound card. An analog-to-digital converter takes the signal and converts it to a stream of digital data (ones and zeros). Then the software goes to work. While each of the leading speech recognition companies has its own proprietary methods, the two primary components of speech recognition are common across products. The first piece, called the acoustic model, analyzes the sounds of your voice and converts them to phonemes, the basic elements of speech. The English language contains approximately 50 phonemes. Here's how it breaks down your voice: First, the acoustic model removes noise and unneeded information such as changes in volume. Then, using mathematical calculations, it reduces the data to a spectrum of frequencies (the pitches of the sounds), analyzes the data, and converts the words into digital representations of phonemes. For example, look at this sentence, which has been broken down into phonemes: Now the second major component of speech recognition software, the language model, kicks in. The language model analyzes the content of your speech. It compares the combinations of phonemes to the words in its digital dictionary, a huge database of the most common words in the English language. Most of today's packages come with dictionaries containing about 150,000 words. The language model quickly decides which words you said and displays them on the screen (in theory). Unfortunately, the English language complicates things. For example, "there," "their," and "they're" all sound the same. A key to the power of today's speech recognition is its use of trigrams, which analyze the context in which a word is used. In many cases, the software can recognize a word by looking at the two words that come before it. If you say, "let's go there," for example, the "let's go" helps the software decide to use "there" instead of "their." Speech recognition packages also tune themselves to the individual user. The software customizes itself based on your voice, your unique speech patterns, and your accent. To improve dictation accuracy, it creates a supplementary dictionary of the words you use. Speak and You Shall Be Heard Dragon Systems, IBM, Lernout & Hauspie, and Philips are the major speech recognition companies in the PC arena. However, on March 28 L&H announced an agreement to purchase Dragon Systems. The company says it will continue to offer both product lines for the immediate future, which means L&H products will account for a dramatic majority of speech recognition software sales. According to IDC analysts, Dragon Systems holds about 60 percent of the market, with IBM and L&H vying for second place. Dragon Systems, L&H, IBM, and Philips each offer basic packages that cost about $50. More sophisticated versions from Dragon, L&H, and IBM have larger dictionaries and more extensive application support, and cost between $200 and $250. Speech recognition's complexity pushes the limits of PC processing power. Although most packages will work with a 200-MHz Pentium, a 300-MHz or faster chip dramatically improves performance. New chips such as the Pentium III and the Athlon satisfy the applications' demand for power even better, and many high-end packages can take advantage of the PIII's multimedia extensions. And the more RAM, the better: Consider 64MB a practical minimum, with 128MB providing substantial improvements. Most speech packages come with a basic headset microphone, but a better one from a third party can improve recognition. Andrea, Plantronics, and VXI sell a variety of headset microphones ranging in price from $30 to $150. The quality of your PC's sound card is also crucial. Cheap models won't cut it because they produce distorted, low-quality output. While standard 16-bit sound cards work, a high-quality card that costs $100 to $150 will offer better performance. Or you could try Dragon System's $80 USB headset, which bypasses the sound card entirely (thanks to its built-in digital signal processor) and works great with notebooks. Beyond Word Processing Most of today's speech recognition packages also allow voice control of many Windows applications (find out from the vendor which programs the recognition software works with). The packages usually do this by converting spoken words into the appropriate text or commands and sending them to the application. Applications such as Word or Excel look for standard commands, and whether those commands come from a keyboard or your mouth doesn't matter. In addition, most speech recognition packages work with your browser, allowing you to "voice surf" the Web. Voice surfing is just the start of what you'll be able to do. Dragon and L&H now offer portable digital voice recorders that download recordings to your PC when you get back to the office; your PC's speech recognition software transcribes your notes directly from the recorder. Analysts say portable devices--such as Web-enabled mobile phones, which don't have standard keyboards--are next on the horizon. Rather than having full-fledged speech recognition, these devices will be tuned to a limited range of specific applications, such as getting stock updates. For desktop PCs, the next major leap is three to five years away, when technologies such as natural language processing and artificial intelligence come to the consumer. Natural language processing analyzes the context of a word by looking at a whole sentence instead of a few words, resulting in greater accuracy. Even more sophisticated (and perhaps frightening), artificial intelligence will allow computers to understand what you mean instead of just what you say. Speech packages will hold a discussion with you and will analyze the emotional aspects of your voice. Source: http://www.pcworld.com/article/162762/how_it_works_speech_recognition.ht ml Further Reading: http://electronics.howstuffworks.com/speech-recognition.htm