Speech Recognition, Voice Recognition, and Natural Language Processing. All of these technologies are connected and relate to taking the human voice and converting it into words, commands, or a variety of interactive applications. In addition, voice recognition takes this application one step further by using it to verify, identity, and understand basic commands. These technologies will play a greater role in the future and even threaten to make the keyboard obsolete. The articles that follow focus on the terms that define the applications, how the technology works, who the major players will be in the future, and how these players envision many of the applications evolving. These articles are meant to present a comprehensive view on the technology which we will then bring to a focus during class. Some study questions to think about: 1. What are some applications of speech and voice recognition technology in the future? 2. Do you view this as a disruptive technology? If so, what is it disrupting? 3. Where in your day-to-day life could you see yourself using these technologies? Don’t worry about the details of how it works, we will cover these in class. Spend your time thinking more along where the technology is now, where it is going, and how it can impact your future both personally and professionally. WEB LINKED ARTICLES The future of talking computers Last modified:October 13, 2003, 1:00 PM PDT http://news.com.com/2008-1011_3-5090381.html?tag=st_rn Microsoft handhelds find their voice Last modified: November 3, 2003, 9:03 AM PST http://news.com.com/2100-1046-5101193.html Voice Authentication Trips Up the Experts Published: November 13, 2003 http://www.nytimes.com/2003/11/13/technology/circuits/13next.html?adxnnl=1&adxnnlx=1069285 421-l1EJE+7wZPy0+t79yv0xjw Dragon: Worth Talking To http://www.businessweek.com/technology/content/jan2002/tc20020130_6540.htm Summary: What is Speech Recognition? Speech recognition (SR) is an emerging technology that will impact the convergence of the telephony, television and computing industries. SR technology has been available for many years. However, it has not been practical due to the high cost of applications and computing resources and the lack of common standards to integrate SR technology with software applications. The business community has not yet fully embraced SR -- the voice-to-text (dictation) applications only generated $48 million of revenue in the US during 1996. According to William Meisel, of the "Speech Recognition Update" newsletter, business has not yet moved to SR technology due to: quality - ability to understand spoken words, ability to discern between like words (to, too, two) cost of applications and computing resources lack of integration between SR software, operating systems, and applications. However, these concerns are being addressed. The SR area should experience significant growth and have a substantial impact on business and society over the next 10 - 20 years, particularly in telephony (call center management, voice mail, PDAs) and voice-to-text (VTT) applications. Speech Recognition Technology and Applications Speech recognition is an enabling technology that may radically change the interface between humans and computers (and other devices having computational abilities). The current interface with these devices is the keyboard/keypad and mouse. However, SR is a complex technological challenge. In order to achieve SR a computer must perform the following functions: recognize the sentence that is being spoken break the sentence down into individual words interpret the meaning of the words act on the interpretation in an appropriate manner (i.e., deliver information back to the user). It takes the average human 10 plus years to develop the rudimentary elements of this task! SR requires a software application "engine" with logic built in to decipher and act on the spoken word. Numerous engines exist; follow this link to see a list of engines and their capabilities. There are three main development weaknesses with most available SR engines: 1. Inability to decipher conversational speech. Most engines are capable of interpreting words that are spoken clearly with a specific cadence in an environment free of significant background interference/noise. This weakness requires users to develop SR "computing skills". The user needs to learn the language of the specific SR engine. Work is being done at Stanford University's Applied Speech Technology Laboratory to develop Conversational Speech Recognition (CSR). Conversational means that the engine is able to interpret a user's skills and ask appropriate questions to ensure that the correct commands are being executed. In essence, CSR adapts to the user instead of making the user adapt to it. 2. Lack of standards for quick and economical application development. The Speech Recognition Programming Interface Committee (SRAPI) is working with a consortium of technology firms to develop standards that will bring the capabilities of SR into mainstream acceptance and use. 3. Ability to interpret the context of the speaker is a critical limitation of current technology. It is difficult to program an engine to recognize and interpret speaker context. Victor Zue at MIT is working to improve this situation by developing engines that can operate within the context of specified content domains. An article describing Zue's work and the limitations that context impose on SR applications can be found in the Economist. The implications of this technology will be far ranging once the cost of computing resources become reasonable, standards are developed, and competent SR engines are developed. For the most part, the conditions mentioned above have already been met. Costs have been reduced - SR products can now be run on existing Pentium level PCs. Standards have developed but still need improvement - click here for a discussion of standards. Numerous engines have been developed with a broad range of capability. These advancements have resulted in products like: Interactive Voice Response systems (IVRs) including call center activities and voice mail systems. Cyber Voice Incorporated is one of many firms that are using SR technology to improve customer call center operations. Voice-to-text (VTT) applications that take dictation of words and numbers and automatically insert them into word processors/spreadsheets and are used to perform common computing commands. IBM and other software providers offer applications with similar features. A list of other commercially available SR applications shows the potential of SR technology. Benefits of SR Applications VTT increases efficiency of workers that perform extensive typing or data entry activities (both numbers and words can be dictated). This could be particularly beneficial in legal, medical and insurance environments where large amounts of dictation and transcription occur. VTT SR applications have the ability to prevent repetitive strain disorders caused by keyboards. It eliminates the need to type or use a mouse. However, anecdotal incidents of voice disorders caused by the use of SR are already surfacing in the media. VTT can be used to assist individuals with disabilities in performing a broader range of jobs in the work environment. Interactive voice response systems have many benefits in the management of call centers by reducing staffing costs and by screening phone calls and apportioning calls to the correct/appropriate service provider (computer or human). Potential Future Applications Speech driven web browsers: The user interface changes from keyboard/mouse to speech. This allows greater access to the net because a telephone can serve as the interface. Users could theoretically get voice mail and email using this functionality. Email could be read to them via voice synthesizing software (textto-voice or TTV) which is already commercially viable. SR servers on the internet: A customer may be exploring the internet and see an advertisement that interests her. A simple click on a button gives her an internet connection to a SR application on the company's server which determines her need and funnels her call to the appropriate service provider (computer or human)! Speech driven desktop telephony: Telephone user simply says "call home", "conference in Jeff" and the computer-telephone integrated device executes the command. Wildfire Home Page Personal Digital Assistants (PDAs) with SR Capability: Devices would not have SR capability locally (due to computing power) but could be used as an interface (via wireless telephone service) between the user and a remote server with SR capability. A talking, handheld, and "thin client" PDA! SR Appliances: A washing machine user would simply tell the appliance "start a cold cycle". The SR computing power would reside a central (home) PC, or the internet, which is connected to the appliance and executes the command for it and other home appliances. The central PC, or internet, resource is required because the cost of installing the computing power and software on each individual appliance would be uneconomical. Related Links The Applied Speech Laboratory MIT Lincoln Laboratory Speech Systems Commercial Speech References and Books on Speech Recognition at CSLI Technology Group Recognition How Does Speech By James Matthews Recognition Work?Printable Version How does a computer convert spoken speech into data that it can then manipulate or execute? Well, from a general perspective, what has to be done? Initially, when we speak, a microphone converts the analog signal of our voice into a digital chunks of data that the computer must analyze. It is from this data that the computer must extract enough information to confidently guess the word being spoken. This is no small task! In fact, in the early 1990s, the best recognizors were yielding a 15% error rate on a relatively small 20,000 word dictation task. Now though, that error percentage has dropped to as low as 1-2%, although this can vary greatly between speakers. So, how is it done? Step 1: Extract Phonemes Phonemes are best described as linguistic units. They are the sounds that group together to form our words, although quite how a phoneme converts into sound depends on many factors including the surrounding phonemes, speaker accent and age. Here are a few examples: aa father ae cat ah cut ao dog aw foul ng sing t talk th thin uh book uw too zh pleasure English uses about 40 phonemes to convey the 500,000 or so words it contains, making them a relatively good data item for speech engines to work with. Extracting Phonemes Phonemes are often extracted by running the waveform through a Fourier Transform. This allows the waveform to be analyzed in the frequency domain. Well, what does this mean? It is probably easier to understand this principle by looking at a spectrograph. A spectrograph is a 3D plot of a waveform's frequency and amplitude versus time. In many cases though, the amplitude of the frequency is expressed as a colour (either greyscale, or a gradient colour). Below is the spectrograph of me saying "Generation5": As a comparison, here is another spectrograph of the "ss" bit of assure (this is a phoneme): Using this, can you see where in "Generation5" the "sh" of Generation5 comes in the spectrograph? Note that the timescales are slightly different on the two spectrographs, so they look a little different. As you can see, it is relatively easy to match up the amplitudes and frequencies of a template phoneme with the corresponding phoneme in a word. For computers, this task is obviously more complicated but definitely achievable. Step 2: Markov Models Now that the computer generates a list of phonemes, what happens next? Obviously these phonemes have to be converted into words and perhaps even the words into sentences. How this occurs can be very complicated indeed, especially for systems designed for speaker-independent, continuous dictation. However, the most common method is to use a Hidden Markov Model (HMM). The theory behind HMMs is complicated, but a brief look at simple Markov Models will help you gain an understanding of how they work. Basically, think of a Markov Model (in a speech recognition context) as a chain of phonenes that represent a word. The chain can branch, and if it does, is statistically balanced. For example: Note that this Markov Model represents both the American English and the (real) English methods of saying the word "tomato". In this case, the model is slightly biased towards the English pronounciation. This idea can be extended up to the level of sentences, and can greatly improve recognition. For example: Recognize speech Wreck a nice beach These two phrases are surprisingly similar, yet have wildly different meanings. A program using a Markov Model at the sentence level might be able to ascertain which of these two phrases the speaker was actually using through statistical analysis using the phrase that preceded it. For more information on Markov Models, see the Generation5 introductory essay. Conclusion This essay hopefully gave you a decent overview of how speech recognition works. The stress is on the word overview - speech technologies are quickly moving forward, and the algorithms and methods described in this essay are being greatly optimized and improved. With the advent of intelligent, filtering microphones and near-perfect speech-recognition, we will hopefully see a new era of human-computer interaction evolve. Technology of SR: http://www.emory.edu/BUSINESS/et/speech/technology.htm The complicated technologies supporting Speech Recognition systems vary as much as the voice itself. However, the underlying technology of SR is basically the same for all the major applications today. In the simplest sense, speech is input into the computer, which is then parsed and/or identified by the Speech Recognition program. Next, the processor runs a series of algorithms to determine what is believed to have been said (based on other technologies to be explored next) and responds to the audible message, either as a command or speech-to-text input. The ultimate objective for developing SR technologies is to create a system through which humans can speak to a machine in the same way they would converse with another human being. Essentially, we will speak in a natural language to the humanized computer system, without regard to perfect syntax or grammar. "When a speech recognition system is combined with a natural language processing system, the result is an overall system that not only recognizes voice input but also understands it." (Turban) Natural Language Processing (NLP) has two basic methods for interpreting voice input: 1) Keywording: The speech is recorded and the computer generates results based on important words or phrases. For instance, this application works well for performing tasks on an operating system: "Open file", "select all", etc. Keywording is also used in call centers (i.e. you say the party’s name or extension instead of pressing keys on the number pad). 2) Syntactic and Symantec Analysis: This process is much more complex than Keywording. As the speaker inputs audible data, the VR program parses the noise and computes what is believed (by the system) to be what the user inputs. This technique requires an extensive set of algorithms, rules, and definitions. For instance, when the word "two" is spoken into the system, the program can predict that "2" is intended (instead of "too" or "to"). The computer may determine the appropriate meaning of this homonym by analyzing the syntax, semantics, and sentence structure. This method is best applied to word processing and data entry. Another important technology associated with SR is the ability for the program to understand fluid speech versus unnatural speech with pauses between each word. This ability marks the difference between Continuous Speech systems and Discrete Speech systems. While Discrete Speech systems are not conducive to natural human speech, they are highly accurate. On the other hand, as expected, the Continuous Speech model that is closer to a human's natural talking has a lower accuracy rate. Several companies have developed and distributed "Speech Engines." These "engines" are essentially databanks of all possible words, phrases, syllables, phonemes, etc. through which the SR programs search to find a reasonable result. Each speech engine offered by each different developer operates on a different principle. For instance, the Microsoft Speech Recognition Engines use either an "acoustic model" or a "dictation language model." Other companies have their own specifications. Speech Recognition versus Voice Recognition http://www.emory.edu/BUSINESS/et/speech/srvr.htm Although Speech Recognition and Voice Recognition are often mistakenly referred to as the same technology, the two definitely have different underlying technologies and applications. Speech Recognition (SR) is the technology used in applications to interpret spoken words into usable data such as computer commands or word processing. Voice Recognition (VR) is a security-based technology intended to identify and grant rights to a user based on the properties of his or her voice. Current Commercial Applications http://www.emory.edu/BUSINESS/et/speech/players.htm The first commercial applications of computer aided voice recognition came in the medical and legal fields. Physicians and attorneys used to dictate notes on a case to an answering service and a secretary would type the report. As the power of the computer hardware and software improved, the speech recognition capabilities of the computer became sufficient to transcribe these dictations. Rather than having someone re-type the entire report, a human was merely needed to proofread the document after the computer constructed a rough draft. Soon the necessity for a human proofreader will vanish as the technology becomes even more powerful. The need for an accurate and efficient method of transcription provided the impetus for today’s commercial voice recognition software. There are three major players in the end-user commercial application of speech recognition; IBM, Lernaut and Hauspie, and Dragon Systems. These three companies provide software packages that convert audible words into digital data that the computer applications can transform into usable data. IBM's ViaVoice, L&H's VoiceXPress, and Dragon System's Naturally Speaking are very similar products that are comparable in price, ease-of-use, and features. The deluxe version of these programs costs about $150 and has a vocabulary of over 200,000 words. They will convert voice data into usable data for most popular software applications and have customized interfaces for the Microsoft suite of applications. These programs are programmed to recognize and correctly interpret dates, currency and numbers. The user can control the operations of the computer (such as opening and closing files and browsing the Web) through voice commands and macros. The software will also read text and numbers to the user in a human voice. All of these voice recognition programs require an intense training session (from 15 minutes to an hour) to learn the specific patterns of an individual's voice. As computer processor speeds have improved, so has the accuracy and speed of these voice recognition software applications. VoiceXML In March 2, 1999, twenty leading speech, Internet and communications technology companies announced the formation of the Voice eXtensive Markup Language Forum to develop a standard in voice recognition technology. The VXML Forum "aims to drive the market for voice- and phone-enabled Internet access by promoting a standard specification for VXML, a computer language used to used to create Web content and services that can be accessed by phone."1[1] Once a standard in the computer community is established, there will be an increased adaptation of voice recognition technology by third party software developers. Even simple programs will be able to incorporate voice recognition technology without a large investment in development time and skill. Natural Language Speech Assistant by Motorola and Unisys The Natural Language Speech Assistant (NLSA) is a developer’s toolkit for the development of software that enables customers to access the data they need using their own everyday language (or natural language), rather than restricting the responses to keypad entries or single-word answers. The NLSA equips developers with the tools necessary for writing speech-enabled applications. This eliminates the need to learn the details of programming speech recognizing programs. In addition, it protects programmer's development investments in order to migrate towards different speech recognizers. Furthermore, NLSA will hopefully enhance current Internet Voice Recognition applications as well as develop new and more sophisticated applications by capitalizing on the speech technology available today. Limitations and Potentials of SR: http://www.emory.edu/BUSINESS/et/speech/limitations.htm In the past, the major constraint to developing the perfect Speech Recognition system was the limited processing power of the computer's microprocessor. Once this obstacle was overcome with the development of the microchip, the true limitations of SR technology became visible: the ability to develop logarithms sophisticated enough to nearly perfectly understand, interpret, and respond to voice commands. The answers to this problem still elude the most successful research institutions. For instance, some systems can understand input from a variety of users but with a limited vocabulary bank. Conversely, other systems recognize over 200,000 words but from only a very limited number of users. There does not exist a program that can comprehend extensive vocabularies from various speakers. The commercial programs available today require a "training session" with each user, which may last over an hour. During this time, not only does the user have to learn how to speak to the machine, but the computer also needs to become accustomed to the user's voice. This may be a constraint on productivity because of the lost hours, but this also presents another problem. This new problem lies in the fact that systems will need to learn to understand multiple users in a short time (or instantly). For instance, when we go to the McDonald's drive-thru window and order a burger with ketchup, we will expect the computer system to recognize our verbal input immediately. It would not be fast food if we had to train the Speech Recognition program for half an hour! Another limitation to the use of current SR tools is that there are nearly unlimited variables comprising the noise of voice. For example, when we answer a phone call just as we wake from sleep, our voice sounds different than after we cheered all night at a basketball game. Additionally, background noise poses limiting factors on the effectiveness of SR technology. It is relatively easy for the computer to filter background noise when we are speaking in a quiet office, but if we were to say the same phrase on a busy street the SR systems will be confused. Even though the current SR systems have limitations, significant progress has been made in developing a perfectly reliable SR program. Once these frustrating hindrances are overcome, the potential for SR technology is enormous. The traditional methods of inputting data into a computer such as a mouse and keyboard will become obsolete. Furthermore, the interaction between the user and the computer will commonly be speak/listen/speak… as opposed to mechanical-input/read/mechanical-input… For a further look into the future and potential of Speech Recognition technology visit the Future. The current commercial SR products do not have the capability to be used on a widescale basis. Depending on the application of this technology, it may or may not be an appropriate time to adopt these SR systems. For instance, the SR technology may be effectively implemented to reduce costs in call centers. However, SR technology is perhaps not at a level suitable to increase the productivity of office tasks. With the rate the SR software applications are developing, it will soon be beneficial to employ Speech Recognition technology in everyday functions. Future of SR: http://www.emory.edu/BUSINESS/et/speech/future.htm We could all use K.I.T.T. from the famed 1980’s television hit “Knight Rider” at some time. Whether we want to take the quickest route to the office or ask who the person in the car next to us is, the technologies similar to those demonstrated in this science fiction television series will soon be standard features in automobiles. In fact, Clarion has already developed a first generation AutoPC that can respond to voice commands. This product uses Microsoft’s Windows CE operating system to perform “hands-free” functions such as voice activated calling through a mobile phone, assistance with directions, and additional information on weather, news, and stocks. As Speech Recognition technology improves in terms of accuracy, vocabulary, and its ability to understand natural language, we will see the concept of interactive machines in every arena. From assembly line mechanical tools to intelligent microwave ovens to "writing" a check, we will have the power to use our voice to instruct the electronic devices we encounter everyday. The progress of Speech Recognition technologies, in the near future, may be hindered by the lack of an effective and legitimate standard code. Efforts are being made by AT&T, Lucent, Motorola, and seventeen other leading institutions to develop the Voice eXtensible Markup Language (VXML) standard. However, Microsoft and Unisys are also collaborating to popularize the Standard Application Programming Interface (SAPI), which is not compatible with the VXML standard. Speech and Voice Recognition technologies will probably not flourish until these standards are approved by the World Wide Web Consortium (W3), just as the World Wide Web did not grow until the HyperText Markup Language (HTML) standards had been adopted. It is apparent that humans are trying to create a computing environment in which the computer learns from the user instead of one where the user must learn how to use the computer. Speech Recognition technology is the next obvious step in an attempt to integrate computing into a "natural" way of life. This effective means of communication, even when perfected, will still present limitations as to how humans can express themselves. The bottleneck of the future will be the physical constraint of not being able to speak all of one's thoughts in a coherent and sensible fashion. Instead, scary as it may be, computers may have systems in place that can receive and interpret neurological data.