Speech Recognition—An Evolving Computer Input Technology

advertisement
Speech Recognition Technology
Introduction
The idea of talking to computers seems like something out of a science fiction movie. However, speech
and voice recognition technologies are not new; they have been around for more than fifty years. Voice
recognition (sometimes called voiceprint) is used most frequently for security identification purposes.
Speech recognition is being used for transaction systems and as input complementing the use of the
keyboard and mouse. A common example of using a transaction system is speaking commands into the
telephone to access voice mail or in responding to automated telephone answering systems. Commands
such as—Say or press 1—are frequently used. The use of a microphone to input text rather than keying
the text is growing. These two uses of voice recognition are very different.
Speaker Independent Systems
The transaction system in the telephone illustration has to recognize the words spoken by any person
who uses it; that is, it is speaker independent. Therefore, the vocabulary that the system recognizes is
very limited and usually relates to a very specific field or topic. These systems tend to be discrete
systems; that is, each word spoken is a separate unit, which is preceded and followed by a pause. The
vocabulary limitations and the requirement of a slight pause before and after words generally result in a
high accuracy rate.
Speaker Dependent Systems
Dictating into a microphone and having the software automatically enter the text dictated is a far more
complex operation because speech patterns vary dramatically among individuals.
Therefore, the system has to learn how to recognize the words spoken by each speaker. Continuous
speech is generally used for dictation.
In continuous speech (often called natural language) words are spoken as phrases or sentences without
pauses before or after them.
Hardware and Software
Wise users carefully analyze both hardware and software needs prior to selecting and installing a
continuous speech recognition system.
Hardware Requirements
When asked about the resources needed to run continuous speech recognition software, users often
respond, “more resources and faster resources are always better.” Although the minimum requirements
specified by vendors vary somewhat, most knowledgeable users recommend a personal computer with a
Pentium MMX/200 or higher processor, 48–64 MB of RAM and 2 or more GB of hard disk space. A
good sound system and a good noise-canceling headset are also essential.
Currently, most continuous speech recognition software users operate on standalone personal computers.
If multiple users share a computer and store speech files, such as in a classroom setting, the resource
requirements may be even greater. Although the software runs on networked systems, large corporate
networks present tremendous challenges. Bandwidth and layers of servers become critical issues.
Supporting the systems and training users are issues not easily resolved.
Software Considerations
Software components include:
 A continuous speech recognition engine
 Dictionaries or vocabularies
 Interfaces with application software
 Algorithms for processing natural language
Four vendors of speech-recognition software currently lead the market:
L&H Voice Xpress™
Dragon NaturallySpeaking™
IBM ViaVoice™
Phillips FreeSpeech™
This white paper focuses on:
 L&H Voice Xpress because it is supported by Microsoft and is completely integrated into
Microsoft Word.
 Dragon NaturallySpeaking because it is bundled with the Corel WordPerfect Office 2000
suite.
Common Features
Both software programs share common features. Although slightly different terminology may be used
to describe the features, the features are very similar. Installing the software is only the beginning of the
startup process.
Audio and Microphone Setup
Both programs are speaker dependent; therefore, the microphone and speakers have to be adjusted and
tuned for each speaker. Correct tuning and positioning of the microphone are just as important to speech
recognition as correct posture and keying techniques are to keyboarding.
Enrollment
Training the software is part of the startup process. This process consists of developing and storing a
speech profile or voice print for each speaker. The initial enrollment requires the speaker to read into
the microphone from text provided on the screen by the system. The amount of time varies from 30
minutes to more than an hour. When the reading has been completed, the system then processes and
stores the individual’s speech files. This profile is then used each time the speaker dictates text into the
system. Depending on how clearly the speaker enunciates and how easily the system recognizes the
words the user says, enrollment or training may have to be repeated several times. Training is an
important way to increase accuracy.
Multimedia Tours and Tutorials
Both programs offer video tours of the software to introduce the new user to the features of the system
including how to position the microphone. Online help, including demonstrations of what to do and
what not to do, also makes it easier to learn how to use the software.
System Training Features
When a speaker says new words that the system does not recognize, the speaker has to stop and train the
system to recognize those words. Training consists of keying or selecting a word and then pronouncing
it for the system. Both software packages also have features to increase the system’s vocabulary. The
more you train the system and build the vocabulary the more accurate your dictation will be—30 to 60
hours of training are often needed to reach 90-95 percent accuracy levels.
Continuous Speech Input
Users speak in a normal fashion without pauses between words. In fact, accuracy increases when the
speaker uses long phrases or whole sentences because the context helps the system to recognize and
select the correct word. All systems require the speaker to use words to indicate punctuation marks.
Commands
The speaker has to help the system distinguish between words that are dictated and words that are
commands used to tell the system what to do. This is accomplished by pausing before a command, then
dictating the entire command without a pause, and pausing after the command has been given. A variety
of commands exist. Some commands are designed to control application software. For example, L&H
Voice Xpress is completely integrated into Microsoft Word and can be used to give any command on the
menu system, such as Close File or Print.
Many global commands are used for editing, formatting, and navigating through a document. They are
accomplished in much the same way that they would be executed from the keyboard or the mouse. For
example, words are selected before a formatting characteristic such as bold, italics, or underline is
applied. The same is true with dictating a command. The user tells the system to go to and select a
word or words and apply bold to the selected text.
Editing and Formatting Documents
Documents can be edited and formatted during the dictation process or after the dictation has been
completed. Using a combination of voice, keystrokes, and mouse clicks produces the most efficient
editing results. Editing is a critical skill that often requires more time than dictation.
Efficiency and Effectiveness
A tremendous amount of hype exists about the productivity and accuracy of speech recognition
software. Users—particularly those associated with the vendors or trainers—frequently brag about input
at 160 words per minute with accuracy rates above 95 percent. Generally, they have trained the system
extensively and are experienced dictators. They rarely talk about the input source material or about the
total productivity time.
Most of the accuracy rate information comes from reading from written copy. The accuracy rate
produced from reading is dramatically higher than the accuracy rate produced from composing and
dictating to the system. The real test of accuracy is the rate achieved by composing and dictating
directly to the system. Reading is an excellent technique for initial learning, but the most crucial skills
that need to be learned to use speech recognition software effectively are composition and dictation
skills.
Reports are available from a number of studies conducted to determine the efficiency and effectiveness
of speech recognition programs. However, many of these studies used limited samples and were not
conducted under stringent testing conditions.
Results from independent laboratory tests indicate average accuracy rates of 87 to 91 percent, which is
not acceptable for general usage. The process of editing a document with a high percentage of errors is
painstaking and time consuming. Some users, however, are able to achieve 98 percent accuracy
consistently. Some laboratory tests compared keyboard input with speech recognition input. If the
typist had average keying skill, the time required to produce 100 percent accurate documents was less
than the time required to accomplish the same task with voice recognition. The time that matters in the
business world is the time that it takes to complete a document with 100 percent accuracy.
Some of the laboratory tests reported differences in accuracy rates based on gender in all of the speech
recognition programs tested. The average accuracy rate was higher for females than males.
Successful Applications
Some applications of speech recognition technology are more advanced than other applications.
Professionals in the medical field have extensive experience with dictation systems and with speech
recognition systems. Large dictionaries have been built and incorporated in speech recognition
programs for many medical specialties—radiology, pathology, orthopedics, internal medicine, and
emergency medicine. Special editions have also been developed for the legal profession.
Ergonomic Considerations
Speech recognition software offers tremendous potential for individuals who are keyboard impaired.
Many people think speech recognition is the answer to carpal tunnel problems. However, extensive,
repetitive dictation may be even more harmful to the voice than repetitive motion is to the wrists. Care
needs to be taken to avoid these harmful effects. Another key concern is the noise pollution that is likely
to occur now that most employees work in an open office environment.
The Future of Speech Recognition
The technology will continue to evolve and mature. Productivity and accuracy will increase and so will
general acceptance of the technology. Although it is unlikely that voice recognition will replace
keyboarding in the near future, it is likely that it will complement the use of the keyboard and the
mouse.
Learning to use speech recognition software effectively is a very good investment of time. Learning to
compose and dictate effectively may be an even better investment of time. Oral communication skills
are skills for the future.
Download