Ten case studies of the effect of field conditions

advertisement
TEN CASE STUDIES OF THE EFFECT OF FIELD
CONDITIONS ON SPEECH RECOGNITION ERRORS
David L. Thomson
Technical Manager
Speech Processing Group
Lucent Technologies
Naperville, IL
Abstract - This paper follows ten speech recognition services as they are deployed
in live trials and services in the telephone network. We focus in particular on
measured recognition accuracy in the lab setting, during the first field deployment,
and through several field iterations. We cite factors that contribute to loss of accuracy
when systems are used by real customers and relate how these problems are identified
and overcome. We observe that a common pattern of performance emerges. We show
that with proper planning, careful data collection and analysis, and adequate time for
technology updates, field performance can match that of the laboratory.
1
Introduction
It is generally believed that speech recognition accuracy drops when services
are deployed in the field. Experience from examples cited here verifies this belief,
at least for the initial deployment of new capabilities or services. Loss of accuracy
happens for an astonishing variety of reasons [1], many of which cannot be
anticipated. Examples of problems that emerge in the field include unexpected
user behavior, inappropriate handling of extraneous noise and speech, mismatch
between training data and speech encountered in the service, and failure to take
into account the range of errors that can be encountered with speech input [2].
Field issues can usually be resolved if time and resources are allocated for running
pre-service trials. Field iterations frequently gives rise to a "slippery-slide" shaped
performance curve where error rates increase as the system moves from the lab to
the field and then taper off as problems are resolved and improvements are made.
In many cases, work to address real-world problems results in new invention that
actually drives error rates below those demonstrated in the laboratory.
This paper reviews deployment of an illustrative set of new services that use
speech technology. We cover a range of applications from simple isolated word
services to complex services that use natural language input. Since these systems
were designed for customers of Lucent Technologies, customer names and some
application details are omitted to protect proprietary interests. The earliest
services reviewed in this paper date back to the mid-‘80s and accuracy has
improved substantially from some of the figures shown here. Nevertheless, the
lessons learned still apply. We see that a well-designed service observation
process and provisions in the deployment schedule for measuring and correcting
field problems are essential for successful speech recognition system deployment.
2
Case Studies
We review ten examples of speech recognition in the field, beginning with an
isolated word recognition task and ending with a natural language application. We
focus on how field problems affected accuracy and how issues were resolved.
2.1
Voice Call Routing 1
Error Rate
In this first example,
callers are prompted to say a 25%
Errors
single digit corresponding to
Rejection
their call type. For example, if
the caller reaches a bank, the
greeting may say "For
accounts, say 'one.' For loans,
say 'two.'" Our customer had
set an error rate requirement of
9% (of files not rejected). We 0%
set an internal objective of a
Targets Phase 1 Phase 2 Phase 3 Phase
five percent false rejection
4
Figure 1. Call routing error rates.
rate. This figure was chosen
Mismatch between training data and preto achieve a desired false
service trials increases false rejection rates.
acceptance rate.
Figure 1 shows accuracy measured in four trial phases. This early service
demonstrates an effect that we have seen repeated many times since, that error
rates are often good in an initial trial, but rejection thresholds are easily thrown off
by a mismatch between training data and speech encountered in the real service. A
more serious complication was that the customer expected the 9% error rate target
to be met for each word in the six digit vocabulary in each of 17 dialectic regions
in the U.K., for a total of 108 test conditions. Although overall accuracy was 1.2%
in Phase 2, the word "two" in Northern Ireland was recognized as "three" about
30% of the time. We learned from field recordings that in that region, "two" is
pronounced "twoee," making it sound somewhat like "three." Increasing the
proportion of the problem word in our training data reduced its error rate to 6.3%.
2.2
Voice Call Routing 2
In a second call routing service, lab simulations were matched by field results
in Phase 3, as shown in Figure 2. However, the nature of the service made the
false acceptance rate of 4.5%, an otherwise good figure, unacceptable. Analysis of
recordings revealed that the recognizer was triggering on sounds such as breath
noise a small percentage of the
time. We built a database of the
offending sounds and created a 10%
"breath" model, driving the false
acceptance rate to 1% [3].
Error Rate
2.3 Voice Activated
Custom Calling
Errors
False Rejection
False Acceptance
2.4
Voice Name Dialing
Error Rate
In an application designed to
control features such as call
forwarding, the false rejection 0%
rate was up 20% in Phase 1 (See
Lab Phase 1 Phase 2 Phase 3 Phase 4
Figure 3).
The recognition
Figure 2. Call routing error rates. False
models had been built from
acceptance rates identified as critical in
speech recorded over digital
Phase 3 are addressed in Phase 4.
lines, but the trial was run in a
rural Idaho community with very
long local loops, creating a mismatch between the models and the field
environment. New models made from recordings from this first phase fixed the
problem. Other improvements yielded
25%
Errors
final results superior to lab results [4].
Rejection
We made a strategic error in an early
voice name dialing service in not
building the capability to make
recordings of the system in use. This
0%
mistake forced us to rely on anecdotal
Lab Friendly Friendly Live
accounts of service performance and
Result User
User
Field
made accuracy tuning difficult. In a
s
Trial
#1
Trial
#2
Trial
focus group, users reported accuracy in
the range of 30% to 75%, yet when we Figure 3. Custom calling error rates.
evaluated the system against a test The "slippery-slide" shape of error
database (without tuning), we measured curves is evident as the recognizer is
89%.
We eventually resolved the tuned to a new environment.
problems with a great deal of effort, and
have since developed a better system, but we learned an important lesson: Never
deploy speech recognition without a good service observation capability.
2.5
Automated Attendant 1
An application where callers were prompted to say one of five isolated words
was run first as a trial, then several years later as a real service. Results are shown
in Figures 4 and 5. In Phase 2A we found that 20% of the callers spoke the correct
keyword, but embedded it in a phrase. The first wordspotting algorithm was
subsequently developed and reduced the error rate to that shown in Phase 3A [5].
The importance of service observation is highlighted in Figure 5 where we were
not allowed to make recordings during Phase 1B and thus saw no improvement in
Phase 2B. Recordings from Phase 2B, combined with a series of technology
improvements led to improved results in the final service.
Errors
Rejection
Error Rate
0%
Objec- Phase Phase Phase
tive
1A
2A
3A
Figure 4. First set of the automated
attendant trials. A wordspotting
algorithm is used to reduce error
rates for embedded speech.
2.6
10%
Errors
Rejection
Error Rate
10%
0%
Objec- Phase
tives
1B
Phase
2B
Live
Service
Figure 5. Further automated
attendant trials and deployment.
Failure to make recordings holds
error rates constant in Phase 2B.
Automated Attendant 2
Error Rate
Another automated attendant service with a 24-word greeting vocabulary
illustrates the danger in allowing an external organization to measure accuracy
without a procedure for verifying
the results. Our lab simulations
13%
Errors
easily met requirements, but an
Rejection
independent evaluation by the
customer showed our system
failing rejection metrics (See
Figure 6).
When we were
allowed to review the study we
found it had been hastily done.
0%
Results from our recognizer had
Require- Lab Customer Corrected been compared to inaccurate
ments
Results
Results
transcriptions of field recordings.
Once these and other errors were
Figure 6. Error rates for a 24-word
corrected, the rates shown in the
automated attendant. Data mislabeling
by customer inflates error measurements. last phase were obtained.
One issue dealt with in the field version that often plagues new speech recognition
systems is customer restarts, meaning the caller provides invalid input, pauses, and
then speaks a valid phrase. The invalid input can be a word, breath noise, or
background noise. The expectation is that the recognizer will reject the invalid
input and accept the valid phrase, but this is not as simple as it appears. First,
accurate out-of-vocabulary rejection is difficult and can result in rejection of
legitimate input. Second, the service application software must be intelligent
enough to distinguish spurious noises from a serious attempt to speak a correct
phrase. If the input is spurious, the recognizer should continue to listen. If it is a
failed attempt, the system should reprompt the caller. These distinctions are made
by a combination of timers, energy detectors, and recognition pass/fail criterion,
designed around input the system is likely to encounter.
2.7
Wireless Digit Dialing
Error Rate
In our first experiment with voice dialing of telephone numbers from mobile
phones we did not have wireless recognition models and were forced to use landline models in Phase 1 while we waited for the wireless data collection to finish.
(See Figure 7. Replacement errors and rejection errors are added together.) Data
collection improved the results in Phase 2. We were surprised to learn that, at this
point, users were delighted with the service, despite accuracy that was still poor
from our point of view.
Even when people had
10%
to repeat the number
two and three times to
get the entire 7-digit
string correct, they felt
that the convenience of
not having to dial while
driving was worth it.
0%
Accuracy was further
improved in Phase 3 by
Figure 7. Wireless digit dialing accuracy. Per-digit building models using
recordings from actual
error is high with wired models, but drops as field
service instead of from
data is used to build new models and algorithms.
the data collection. In
addition, we discovered that some users were using mobile speakerphones (against
our strong advice), so we folded speakerphone data into Phase 3. Optimization
and hardware improvements gave the results in Phase 4, though the accuracy falls
far short of that of our latest (1997) wireless recognizers.
Lab
2.8
Phase 1 Phase 2 Phase 3 Phase 4 1997
ATM Speaker Verification
One candidate for speaker verification technology is in confirming the identity
of ATM (Automated Teller Machine) users. In a trial conducted in cooperation
with a large bank, we had seen about a 5% EER (Equal Error Rate) in the lab for
4-digit, randomly prompted tasks. However, when the bank began to measure
performance, they reported that the system was failing 40% of the time.
Fortunately, we had made recordings and found that 7% of these failures were
"imposters" - people experimenting with the service to see if they could break into
Error Rate
someone else's account – and 28% were people failing to speak within the
allocated time window. Once we eliminated these data points, the measured error
rate dropped to 11%, as shown in Figure 8, Phase 1. Tuning and other
improvements cut that figure in half. It is interesting to note that another test
database collected at the same
time with the same task 40%
yielded dramatically different
results, with a final EER of
only 2.4% [6]. This shows
how results can be highly
dependent on the test data.
2.9
Voice Calling Card
0%
Lab Bank Phase 1 Phase 2 Phase 3
One common reason lab
results do not match field Figure 8. ATM speaker verification errors
results is that lab numbers measured by the bank included user errors.
often ignore rejection errors.
Accuracy, for example, may be quoted to be in the mid-90's, yet in a real service,
some of the valid input is rejected even if it would otherwise have been correctly
recognized. This is an unavoidable result of the recognizer's attempt to reject the
invalid input that is always present for some fraction of users in a real service.
In a previous large-scale connected-digit service, we had measured a string error
rate of 5.9% on 10-digit strings, but based new requirements on the assumption
that an additional 5% of the valid strings would be rejected. To leave a margin for
testing error and environmental uncertainties, requirements were set at 16.2%.
Error Rate
40%
0%
Lab
Lab with Requir- Survey Phase 1 Phase 2 Phase 3 Final
Rejection ements Sheets
Service
Figure 9. Calling card digit string accuracy. Tally Sheets
are an unreliable method for measuring performance.
This figure sounds like an easy target, but our first measurement of field accuracy
showed a dismal 44.2% error rate, based on a user survey of people calling the
service and keeping score on tally sheets. We have learned that manual reporting
of accuracy is nearly always wrong by a wide margin. The only way to reasonably
evaluate performance is to analyze recordings. Figure 9 shows that the actual error
rate in Phase 1 was less than half of the tally sheet result. A series of algorithmic
improvements brought the final error rate to under 10% Most of the remaining
errors were from phones with high levels of distortion, noise, or from speakers that
were otherwise difficult to understand.
2.10
Natural Language Movie Locator
Although most of the profitable applications currently in use receive simple
input from customers, our newest systems understand thousands of words and
complete sentences spoken by customers. These natural language systems have as
an objective to eliminate prompting menus and allow free-form input from callers.
As one test case for this technology we ran a series of trials on a movie locator
Error Rate
60%
0%
Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Phase 7
Figure 10. Error rates for a natural language movie locator
service. Switching to a new recognizer (Phases 6 & 7) and
enhancing the grammar yielded high accuracy.
system that would allow a user to ask questions such as "Where is Men in Black
playing near Wheaton?" or "Where is the Ogden 6 Theater?"
In our initial attempt (Phase 1 in Figure 10), the rejection sensitivity was set too
high. By lowering this parameter and by adding new grammar options based on
questions people were actually asking, we cut the error rate almost in half. Further
enhancements through Phase 5 cut the error rate in half again. One improvement
was to include multiple pronunciations for some vocabulary words. For example,
the word "when" was represented by two pronunciations, "when" and "whin,"
based on the way people actually said the word. This allowed the recognizer to
recognize words as caller pronunciation varied. Use of a completely new
recognizer (after some initial tuning in Phase 6) cut the error rate in half again.
3
Conclusions
As we study successful deployments of speech recognition, we see a familiar
pattern emerge where accuracy initially degrades, then improves in the process of
conducting field experiments. Obtaining field results comparable to lab results
requires willingness to conduct several field iterations and a good service
observation system. We see that field performance can be brought to a level that is
acceptable to customers and that speech recognition is a viable option for real
services in the telephone network.
Acknowledgments
The author wishes to thank John Jacob, Anand Setlur, Rafid Sukkar, and Jack
Wisowaty for providing some of the previously unpublished data for this review.
References
[1]
D. L. Thomson, "Looking For Trouble: Planning for the Unexpected in
Speech Recognition Services," IEC Annual Review of Communications, vol.
50, pp. 1089-1093, IEC, 1997.
[2]
B. H. Juang, R. J. Perdue, Jr., and D. L. Thomson, "Deployable Automatic
Speech Recognition Systems: Advances and Challenges," AT&T Technical
Journal, Vol. 74, No. 2, March/April 1995, pp. 45-5.
[3]
D. J. Krasinski and R. A. Sukkar, "Automatic Speech Recognition for
Network Call Routing," Proc. IVTTA 94 (Second IEEE Workshop on
Interactive Voice Technology for Telecommunications Applications), pp.
157-160, Sept. 1994.
[4]
K. V. Kinder, S. Fox, and G. Batcha, "Accessing Telephone Services Using
Speech Recognition: Results From Two Field Trials," Proc. AVIOS (Annual
Voice I/O Systems), pp. 83-89, Sept. 1993.
[5]
R. W. Bossemeyer, J. G. Wilpon, C. H. Lee, and L. R. Rabiner, "Automatic
speech recognition of small vocabularies within the context of unconstrained
input," (abstract) The J. Acoust. Soc. Am., Suppl. 1, 84, 1988.
[6] T. E. Jacobs and Anand K. Setlur, "A Field Study of Performance
Improvements in HMM-Based Speaker Verification," Proc. IVTTA (IEEE
Workshop on Interactive Voice Tech. for Telecom. Appl.), pp. 121-124, 1994.
Download