TEN CASE STUDIES OF THE EFFECT OF FIELD CONDITIONS ON SPEECH RECOGNITION ERRORS David L. Thomson Technical Manager Speech Processing Group Lucent Technologies Naperville, IL Abstract - This paper follows ten speech recognition services as they are deployed in live trials and services in the telephone network. We focus in particular on measured recognition accuracy in the lab setting, during the first field deployment, and through several field iterations. We cite factors that contribute to loss of accuracy when systems are used by real customers and relate how these problems are identified and overcome. We observe that a common pattern of performance emerges. We show that with proper planning, careful data collection and analysis, and adequate time for technology updates, field performance can match that of the laboratory. 1 Introduction It is generally believed that speech recognition accuracy drops when services are deployed in the field. Experience from examples cited here verifies this belief, at least for the initial deployment of new capabilities or services. Loss of accuracy happens for an astonishing variety of reasons [1], many of which cannot be anticipated. Examples of problems that emerge in the field include unexpected user behavior, inappropriate handling of extraneous noise and speech, mismatch between training data and speech encountered in the service, and failure to take into account the range of errors that can be encountered with speech input [2]. Field issues can usually be resolved if time and resources are allocated for running pre-service trials. Field iterations frequently gives rise to a "slippery-slide" shaped performance curve where error rates increase as the system moves from the lab to the field and then taper off as problems are resolved and improvements are made. In many cases, work to address real-world problems results in new invention that actually drives error rates below those demonstrated in the laboratory. This paper reviews deployment of an illustrative set of new services that use speech technology. We cover a range of applications from simple isolated word services to complex services that use natural language input. Since these systems were designed for customers of Lucent Technologies, customer names and some application details are omitted to protect proprietary interests. The earliest services reviewed in this paper date back to the mid-‘80s and accuracy has improved substantially from some of the figures shown here. Nevertheless, the lessons learned still apply. We see that a well-designed service observation process and provisions in the deployment schedule for measuring and correcting field problems are essential for successful speech recognition system deployment. 2 Case Studies We review ten examples of speech recognition in the field, beginning with an isolated word recognition task and ending with a natural language application. We focus on how field problems affected accuracy and how issues were resolved. 2.1 Voice Call Routing 1 Error Rate In this first example, callers are prompted to say a 25% Errors single digit corresponding to Rejection their call type. For example, if the caller reaches a bank, the greeting may say "For accounts, say 'one.' For loans, say 'two.'" Our customer had set an error rate requirement of 9% (of files not rejected). We 0% set an internal objective of a Targets Phase 1 Phase 2 Phase 3 Phase five percent false rejection 4 Figure 1. Call routing error rates. rate. This figure was chosen Mismatch between training data and preto achieve a desired false service trials increases false rejection rates. acceptance rate. Figure 1 shows accuracy measured in four trial phases. This early service demonstrates an effect that we have seen repeated many times since, that error rates are often good in an initial trial, but rejection thresholds are easily thrown off by a mismatch between training data and speech encountered in the real service. A more serious complication was that the customer expected the 9% error rate target to be met for each word in the six digit vocabulary in each of 17 dialectic regions in the U.K., for a total of 108 test conditions. Although overall accuracy was 1.2% in Phase 2, the word "two" in Northern Ireland was recognized as "three" about 30% of the time. We learned from field recordings that in that region, "two" is pronounced "twoee," making it sound somewhat like "three." Increasing the proportion of the problem word in our training data reduced its error rate to 6.3%. 2.2 Voice Call Routing 2 In a second call routing service, lab simulations were matched by field results in Phase 3, as shown in Figure 2. However, the nature of the service made the false acceptance rate of 4.5%, an otherwise good figure, unacceptable. Analysis of recordings revealed that the recognizer was triggering on sounds such as breath noise a small percentage of the time. We built a database of the offending sounds and created a 10% "breath" model, driving the false acceptance rate to 1% [3]. Error Rate 2.3 Voice Activated Custom Calling Errors False Rejection False Acceptance 2.4 Voice Name Dialing Error Rate In an application designed to control features such as call forwarding, the false rejection 0% rate was up 20% in Phase 1 (See Lab Phase 1 Phase 2 Phase 3 Phase 4 Figure 3). The recognition Figure 2. Call routing error rates. False models had been built from acceptance rates identified as critical in speech recorded over digital Phase 3 are addressed in Phase 4. lines, but the trial was run in a rural Idaho community with very long local loops, creating a mismatch between the models and the field environment. New models made from recordings from this first phase fixed the problem. Other improvements yielded 25% Errors final results superior to lab results [4]. Rejection We made a strategic error in an early voice name dialing service in not building the capability to make recordings of the system in use. This 0% mistake forced us to rely on anecdotal Lab Friendly Friendly Live accounts of service performance and Result User User Field made accuracy tuning difficult. In a s Trial #1 Trial #2 Trial focus group, users reported accuracy in the range of 30% to 75%, yet when we Figure 3. Custom calling error rates. evaluated the system against a test The "slippery-slide" shape of error database (without tuning), we measured curves is evident as the recognizer is 89%. We eventually resolved the tuned to a new environment. problems with a great deal of effort, and have since developed a better system, but we learned an important lesson: Never deploy speech recognition without a good service observation capability. 2.5 Automated Attendant 1 An application where callers were prompted to say one of five isolated words was run first as a trial, then several years later as a real service. Results are shown in Figures 4 and 5. In Phase 2A we found that 20% of the callers spoke the correct keyword, but embedded it in a phrase. The first wordspotting algorithm was subsequently developed and reduced the error rate to that shown in Phase 3A [5]. The importance of service observation is highlighted in Figure 5 where we were not allowed to make recordings during Phase 1B and thus saw no improvement in Phase 2B. Recordings from Phase 2B, combined with a series of technology improvements led to improved results in the final service. Errors Rejection Error Rate 0% Objec- Phase Phase Phase tive 1A 2A 3A Figure 4. First set of the automated attendant trials. A wordspotting algorithm is used to reduce error rates for embedded speech. 2.6 10% Errors Rejection Error Rate 10% 0% Objec- Phase tives 1B Phase 2B Live Service Figure 5. Further automated attendant trials and deployment. Failure to make recordings holds error rates constant in Phase 2B. Automated Attendant 2 Error Rate Another automated attendant service with a 24-word greeting vocabulary illustrates the danger in allowing an external organization to measure accuracy without a procedure for verifying the results. Our lab simulations 13% Errors easily met requirements, but an Rejection independent evaluation by the customer showed our system failing rejection metrics (See Figure 6). When we were allowed to review the study we found it had been hastily done. 0% Results from our recognizer had Require- Lab Customer Corrected been compared to inaccurate ments Results Results transcriptions of field recordings. Once these and other errors were Figure 6. Error rates for a 24-word corrected, the rates shown in the automated attendant. Data mislabeling by customer inflates error measurements. last phase were obtained. One issue dealt with in the field version that often plagues new speech recognition systems is customer restarts, meaning the caller provides invalid input, pauses, and then speaks a valid phrase. The invalid input can be a word, breath noise, or background noise. The expectation is that the recognizer will reject the invalid input and accept the valid phrase, but this is not as simple as it appears. First, accurate out-of-vocabulary rejection is difficult and can result in rejection of legitimate input. Second, the service application software must be intelligent enough to distinguish spurious noises from a serious attempt to speak a correct phrase. If the input is spurious, the recognizer should continue to listen. If it is a failed attempt, the system should reprompt the caller. These distinctions are made by a combination of timers, energy detectors, and recognition pass/fail criterion, designed around input the system is likely to encounter. 2.7 Wireless Digit Dialing Error Rate In our first experiment with voice dialing of telephone numbers from mobile phones we did not have wireless recognition models and were forced to use landline models in Phase 1 while we waited for the wireless data collection to finish. (See Figure 7. Replacement errors and rejection errors are added together.) Data collection improved the results in Phase 2. We were surprised to learn that, at this point, users were delighted with the service, despite accuracy that was still poor from our point of view. Even when people had 10% to repeat the number two and three times to get the entire 7-digit string correct, they felt that the convenience of not having to dial while driving was worth it. 0% Accuracy was further improved in Phase 3 by Figure 7. Wireless digit dialing accuracy. Per-digit building models using recordings from actual error is high with wired models, but drops as field service instead of from data is used to build new models and algorithms. the data collection. In addition, we discovered that some users were using mobile speakerphones (against our strong advice), so we folded speakerphone data into Phase 3. Optimization and hardware improvements gave the results in Phase 4, though the accuracy falls far short of that of our latest (1997) wireless recognizers. Lab 2.8 Phase 1 Phase 2 Phase 3 Phase 4 1997 ATM Speaker Verification One candidate for speaker verification technology is in confirming the identity of ATM (Automated Teller Machine) users. In a trial conducted in cooperation with a large bank, we had seen about a 5% EER (Equal Error Rate) in the lab for 4-digit, randomly prompted tasks. However, when the bank began to measure performance, they reported that the system was failing 40% of the time. Fortunately, we had made recordings and found that 7% of these failures were "imposters" - people experimenting with the service to see if they could break into Error Rate someone else's account – and 28% were people failing to speak within the allocated time window. Once we eliminated these data points, the measured error rate dropped to 11%, as shown in Figure 8, Phase 1. Tuning and other improvements cut that figure in half. It is interesting to note that another test database collected at the same time with the same task 40% yielded dramatically different results, with a final EER of only 2.4% [6]. This shows how results can be highly dependent on the test data. 2.9 Voice Calling Card 0% Lab Bank Phase 1 Phase 2 Phase 3 One common reason lab results do not match field Figure 8. ATM speaker verification errors results is that lab numbers measured by the bank included user errors. often ignore rejection errors. Accuracy, for example, may be quoted to be in the mid-90's, yet in a real service, some of the valid input is rejected even if it would otherwise have been correctly recognized. This is an unavoidable result of the recognizer's attempt to reject the invalid input that is always present for some fraction of users in a real service. In a previous large-scale connected-digit service, we had measured a string error rate of 5.9% on 10-digit strings, but based new requirements on the assumption that an additional 5% of the valid strings would be rejected. To leave a margin for testing error and environmental uncertainties, requirements were set at 16.2%. Error Rate 40% 0% Lab Lab with Requir- Survey Phase 1 Phase 2 Phase 3 Final Rejection ements Sheets Service Figure 9. Calling card digit string accuracy. Tally Sheets are an unreliable method for measuring performance. This figure sounds like an easy target, but our first measurement of field accuracy showed a dismal 44.2% error rate, based on a user survey of people calling the service and keeping score on tally sheets. We have learned that manual reporting of accuracy is nearly always wrong by a wide margin. The only way to reasonably evaluate performance is to analyze recordings. Figure 9 shows that the actual error rate in Phase 1 was less than half of the tally sheet result. A series of algorithmic improvements brought the final error rate to under 10% Most of the remaining errors were from phones with high levels of distortion, noise, or from speakers that were otherwise difficult to understand. 2.10 Natural Language Movie Locator Although most of the profitable applications currently in use receive simple input from customers, our newest systems understand thousands of words and complete sentences spoken by customers. These natural language systems have as an objective to eliminate prompting menus and allow free-form input from callers. As one test case for this technology we ran a series of trials on a movie locator Error Rate 60% 0% Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7 Figure 10. Error rates for a natural language movie locator service. Switching to a new recognizer (Phases 6 & 7) and enhancing the grammar yielded high accuracy. system that would allow a user to ask questions such as "Where is Men in Black playing near Wheaton?" or "Where is the Ogden 6 Theater?" In our initial attempt (Phase 1 in Figure 10), the rejection sensitivity was set too high. By lowering this parameter and by adding new grammar options based on questions people were actually asking, we cut the error rate almost in half. Further enhancements through Phase 5 cut the error rate in half again. One improvement was to include multiple pronunciations for some vocabulary words. For example, the word "when" was represented by two pronunciations, "when" and "whin," based on the way people actually said the word. This allowed the recognizer to recognize words as caller pronunciation varied. Use of a completely new recognizer (after some initial tuning in Phase 6) cut the error rate in half again. 3 Conclusions As we study successful deployments of speech recognition, we see a familiar pattern emerge where accuracy initially degrades, then improves in the process of conducting field experiments. Obtaining field results comparable to lab results requires willingness to conduct several field iterations and a good service observation system. We see that field performance can be brought to a level that is acceptable to customers and that speech recognition is a viable option for real services in the telephone network. Acknowledgments The author wishes to thank John Jacob, Anand Setlur, Rafid Sukkar, and Jack Wisowaty for providing some of the previously unpublished data for this review. References [1] D. L. Thomson, "Looking For Trouble: Planning for the Unexpected in Speech Recognition Services," IEC Annual Review of Communications, vol. 50, pp. 1089-1093, IEC, 1997. [2] B. H. Juang, R. J. Perdue, Jr., and D. L. Thomson, "Deployable Automatic Speech Recognition Systems: Advances and Challenges," AT&T Technical Journal, Vol. 74, No. 2, March/April 1995, pp. 45-5. [3] D. J. Krasinski and R. A. Sukkar, "Automatic Speech Recognition for Network Call Routing," Proc. IVTTA 94 (Second IEEE Workshop on Interactive Voice Technology for Telecommunications Applications), pp. 157-160, Sept. 1994. [4] K. V. Kinder, S. Fox, and G. Batcha, "Accessing Telephone Services Using Speech Recognition: Results From Two Field Trials," Proc. AVIOS (Annual Voice I/O Systems), pp. 83-89, Sept. 1993. [5] R. W. Bossemeyer, J. G. Wilpon, C. H. Lee, and L. R. Rabiner, "Automatic speech recognition of small vocabularies within the context of unconstrained input," (abstract) The J. Acoust. Soc. Am., Suppl. 1, 84, 1988. [6] T. E. Jacobs and Anand K. Setlur, "A Field Study of Performance Improvements in HMM-Based Speaker Verification," Proc. IVTTA (IEEE Workshop on Interactive Voice Tech. for Telecom. Appl.), pp. 121-124, 1994.