Speech Recognition Technology Introduction The idea of talking to computers seems like something out of a science fiction movie. However, speech and voice recognition technologies are not new; they have been around for more than fifty years. Voice recognition (sometimes called voiceprint) is used most frequently for security identification purposes. Speech recognition is being used for transaction systems and as input complementing the use of the keyboard and mouse. A common example of using a transaction system is speaking commands into the telephone to access voice mail or in responding to automated telephone answering systems. Commands such as—Say or press 1—are frequently used. The use of a microphone to input text rather than keying the text is growing. These two uses of voice recognition are very different. Speaker Independent Systems The transaction system in the telephone illustration has to recognize the words spoken by any person who uses it; that is, it is speaker independent. Therefore, the vocabulary that the system recognizes is very limited and usually relates to a very specific field or topic. These systems tend to be discrete systems; that is, each word spoken is a separate unit, which is preceded and followed by a pause. The vocabulary limitations and the requirement of a slight pause before and after words generally result in a high accuracy rate. Speaker Dependent Systems Dictating into a microphone and having the software automatically enter the text dictated is a far more complex operation because speech patterns vary dramatically among individuals. Therefore, the system has to learn how to recognize the words spoken by each speaker. Continuous speech is generally used for dictation. In continuous speech (often called natural language) words are spoken as phrases or sentences without pauses before or after them. Hardware and Software Wise users carefully analyze both hardware and software needs prior to selecting and installing a continuous speech recognition system. Hardware Requirements When asked about the resources needed to run continuous speech recognition software, users often respond, “more resources and faster resources are always better.” Although the minimum requirements specified by vendors vary somewhat, most knowledgeable users recommend a personal computer with a Pentium MMX/200 or higher processor, 48–64 MB of RAM and 2 or more GB of hard disk space. A good sound system and a good noise-canceling headset are also essential. Currently, most continuous speech recognition software users operate on standalone personal computers. If multiple users share a computer and store speech files, such as in a classroom setting, the resource requirements may be even greater. Although the software runs on networked systems, large corporate networks present tremendous challenges. Bandwidth and layers of servers become critical issues. Supporting the systems and training users are issues not easily resolved. Software Considerations Software components include: A continuous speech recognition engine Dictionaries or vocabularies Interfaces with application software Algorithms for processing natural language Four vendors of speech-recognition software currently lead the market: L&H Voice Xpress™ Dragon NaturallySpeaking™ IBM ViaVoice™ Phillips FreeSpeech™ This white paper focuses on: L&H Voice Xpress because it is supported by Microsoft and is completely integrated into Microsoft Word. Dragon NaturallySpeaking because it is bundled with the Corel WordPerfect Office 2000 suite. Common Features Both software programs share common features. Although slightly different terminology may be used to describe the features, the features are very similar. Installing the software is only the beginning of the startup process. Audio and Microphone Setup Both programs are speaker dependent; therefore, the microphone and speakers have to be adjusted and tuned for each speaker. Correct tuning and positioning of the microphone are just as important to speech recognition as correct posture and keying techniques are to keyboarding. Enrollment Training the software is part of the startup process. This process consists of developing and storing a speech profile or voice print for each speaker. The initial enrollment requires the speaker to read into the microphone from text provided on the screen by the system. The amount of time varies from 30 minutes to more than an hour. When the reading has been completed, the system then processes and stores the individual’s speech files. This profile is then used each time the speaker dictates text into the system. Depending on how clearly the speaker enunciates and how easily the system recognizes the words the user says, enrollment or training may have to be repeated several times. Training is an important way to increase accuracy. Multimedia Tours and Tutorials Both programs offer video tours of the software to introduce the new user to the features of the system including how to position the microphone. Online help, including demonstrations of what to do and what not to do, also makes it easier to learn how to use the software. System Training Features When a speaker says new words that the system does not recognize, the speaker has to stop and train the system to recognize those words. Training consists of keying or selecting a word and then pronouncing it for the system. Both software packages also have features to increase the system’s vocabulary. The more you train the system and build the vocabulary the more accurate your dictation will be—30 to 60 hours of training are often needed to reach 90-95 percent accuracy levels. Continuous Speech Input Users speak in a normal fashion without pauses between words. In fact, accuracy increases when the speaker uses long phrases or whole sentences because the context helps the system to recognize and select the correct word. All systems require the speaker to use words to indicate punctuation marks. Commands The speaker has to help the system distinguish between words that are dictated and words that are commands used to tell the system what to do. This is accomplished by pausing before a command, then dictating the entire command without a pause, and pausing after the command has been given. A variety of commands exist. Some commands are designed to control application software. For example, L&H Voice Xpress is completely integrated into Microsoft Word and can be used to give any command on the menu system, such as Close File or Print. Many global commands are used for editing, formatting, and navigating through a document. They are accomplished in much the same way that they would be executed from the keyboard or the mouse. For example, words are selected before a formatting characteristic such as bold, italics, or underline is applied. The same is true with dictating a command. The user tells the system to go to and select a word or words and apply bold to the selected text. Editing and Formatting Documents Documents can be edited and formatted during the dictation process or after the dictation has been completed. Using a combination of voice, keystrokes, and mouse clicks produces the most efficient editing results. Editing is a critical skill that often requires more time than dictation. Efficiency and Effectiveness A tremendous amount of hype exists about the productivity and accuracy of speech recognition software. Users—particularly those associated with the vendors or trainers—frequently brag about input at 160 words per minute with accuracy rates above 95 percent. Generally, they have trained the system extensively and are experienced dictators. They rarely talk about the input source material or about the total productivity time. Most of the accuracy rate information comes from reading from written copy. The accuracy rate produced from reading is dramatically higher than the accuracy rate produced from composing and dictating to the system. The real test of accuracy is the rate achieved by composing and dictating directly to the system. Reading is an excellent technique for initial learning, but the most crucial skills that need to be learned to use speech recognition software effectively are composition and dictation skills. Reports are available from a number of studies conducted to determine the efficiency and effectiveness of speech recognition programs. However, many of these studies used limited samples and were not conducted under stringent testing conditions. Results from independent laboratory tests indicate average accuracy rates of 87 to 91 percent, which is not acceptable for general usage. The process of editing a document with a high percentage of errors is painstaking and time consuming. Some users, however, are able to achieve 98 percent accuracy consistently. Some laboratory tests compared keyboard input with speech recognition input. If the typist had average keying skill, the time required to produce 100 percent accurate documents was less than the time required to accomplish the same task with voice recognition. The time that matters in the business world is the time that it takes to complete a document with 100 percent accuracy. Some of the laboratory tests reported differences in accuracy rates based on gender in all of the speech recognition programs tested. The average accuracy rate was higher for females than males. Successful Applications Some applications of speech recognition technology are more advanced than other applications. Professionals in the medical field have extensive experience with dictation systems and with speech recognition systems. Large dictionaries have been built and incorporated in speech recognition programs for many medical specialties—radiology, pathology, orthopedics, internal medicine, and emergency medicine. Special editions have also been developed for the legal profession. Ergonomic Considerations Speech recognition software offers tremendous potential for individuals who are keyboard impaired. Many people think speech recognition is the answer to carpal tunnel problems. However, extensive, repetitive dictation may be even more harmful to the voice than repetitive motion is to the wrists. Care needs to be taken to avoid these harmful effects. Another key concern is the noise pollution that is likely to occur now that most employees work in an open office environment. The Future of Speech Recognition The technology will continue to evolve and mature. Productivity and accuracy will increase and so will general acceptance of the technology. Although it is unlikely that voice recognition will replace keyboarding in the near future, it is likely that it will complement the use of the keyboard and the mouse. Learning to use speech recognition software effectively is a very good investment of time. Learning to compose and dictate effectively may be an even better investment of time. Oral communication skills are skills for the future.