Review of methods for evaluating synthetic speech

advertisement
RNIB Centre for Accessible Information (CAI)
Technical report # 8
Review of methods for evaluating
synthetic speech
Published by:
RNIB Centre for Accessible Information (CAI), 58-72 John Bright
Street, Birmingham, B1 1BN, UK
Commissioned by:
As publisher
Authors:
(Note: After corresponding author, authors are listed alphabetically,
or in order of contribution)
Heather Cryer* and Sarah Home
*For correspondence
Tel: 0121 665 4211
Email: heather.cryer@rnib.org.uk
Date: 17 February 2010
Document reference: CAI-TR8 [02-2010]
Sensitivity: Internal and full public access
Copyright: RNIB 2010
© RNIB 2010
Citation guidance:
Cryer, H., and Home, S. (2010). Review of methods for evaluating
synthetic speech. RNIB Centre for Accessible Information,
Birmingham: Technical report #8.
Acknowledgements:
Thanks to Sarah Morley Wilkins for support in this project.
CAI-TR8 [02-2010]
2
© RNIB 2010
Review of methods for evaluating
synthetic speech
RNIB Centre for Accessible Information (CAI)
Prepared by:
Heather Cryer (Research Officer, CAI)
FINAL version
© RNIB 17 February 2010
Table of contents
Introduction ............................................................................ 3
Objective/Acoustic measures ................................................. 4
User testing ............................................................................ 6
Performance measures ...................................................... 7
Opinion measures .............................................................. 9
Feature comparisons............................................................ 10
Conclusion ........................................................................... 10
References ........................................................................... 11
Introduction
This paper provides background to the development of an
evaluation protocol for synthetic voices. The aim of the project
was to develop a protocol which could be used by staff working
with synthetic voices to conduct systematic evaluations and keep
useful records detailing their findings.
In order to do this, a review of existing literature was carried out to
determine the various methods used for evaluating synthetic
voices. This paper synthesises findings of the literature review.
There are a number of different approaches to evaluating synthetic
voices. The type of evaluation carried out is likely to depend on
the purpose of the evaluation, and the specific application for
which the synthetic voice is intended. The key approaches to
evaluation discussed here are:
 Objective measures
 User testing (performance measures and subjective measures)
 Feature comparisons
CAI-TR8 [02-2010]
3
© RNIB 2010
Whilst these approaches may seem very different, in fact they
complement each other, and different types of evaluation may be
carried out at different stages in the development and use of a
synthetic voice (Francis and Nusbaum, 1999). For example,
objective/acoustic measures are used particularly by developers
as diagnostic tools to refine the voice, ensuring it is as good as
possible. Similarly, testing user performance in listening to the
voice enables further development to make improvements.
Subjective user testing is more useful for someone considering the
voice for use in a product or service, to find out whether users are
happy with the voice. Similarly, feature comparisons are useful to
those considering investing in a voice for their application, to
evaluate whether the voice has the desired features and whether it
fits in with existing systems and processes.
The following review aims to give an overview of these
approaches, explaining when they may be used, advantages and
disadvantages of each approach, and providing some insight into
the complexity of the process.
Objective/Acoustic measures
Historically, synthetic speech evaluations tended to focus on
whether or not the voice was intelligible. As technology has
progressed, synthetic voices have become more sophisticated.
This means that intelligibility is generally a given, so the focus of
evaluations has shifted towards how closely a synthetic voice
mirrors a human voice (Campbell, 2007; Francis and Nusbaum,
1999; Morton, 1991). This is partly based on subjective user
evaluation (discussed more later), but is also deemed important in
an objective sense, in terms of measuring whether or not an
utterance from a synthetic voice acoustically matches the same
utterance in human speech.
This type of objective/acoustic measure forms a large part of
synthetic speech evaluation, particularly for developers of synthetic
voices. By measuring where voices differ from a human utterance,
problem areas can be identified which can then be refined to make
improvements.
Objective/acoustic measures are beneficial for a number of
reasons. Firstly, by their very nature, objective measures offer a
clear measurement of how a voice is performing, and can be used
to diagnose problem areas needing development. Also, compared
CAI-TR8 [02-2010]
4
© RNIB 2010
to the time and cost involved in user testing, objective measures
can be an efficient form of evaluation (Hirst, Rilliard and
Aubergé, 1998; Clark and Dusterhoff, 1999).
The negatives of objective/acoustic evaluation are that findings
from such evaluations may not always match up with listener
perceptions. Researchers have found that some objective
measures may be over sensitive, compared to the human ear.
Some studies show differences between natural and synthetic
voices (highlighted by objective measures) which were not
perceived by users (Clark and Dusterhoff, 1999). The opposite
may also be true, in that synthetic samples may be perfect in
acoustic terms, but may be perceived as unnatural by listeners.
Morton (1991) explains this based on the processing required to
listen to speech. Morton suggests that human speakers naturally
vary their speech, with the aim of being understood, and that this
affects the way human speech is processed by listeners. As a
synthesiser does not have this capability, the processing needs of
the listener are not accounted for. This may cause the listener to
find the synthetic speech unnatural or difficult to understand.
Objective/acoustic measures can be used in evaluation of different
aspects of a voice, such as prosodic features or intonation.
Examples of acoustic measures which may be of interest include
fundamental frequency, segmental duration and intensity. A
common approach is to use statistical methods (Root Mean
Squared Error - RMSE) to model the expected performance of the
voice against actual performance, measuring the accuracy of the
synthetic voice against a natural voice.
Whilst objective measures have been described as more efficient
than user testing, of course specialist knowledge is required in
order to run these tests. Clark and Dusterhoff (1999) report on
trials of three different objective measures aimed at investigating
differences in intonation, and highlight some of the complexities
involved. The generally accepted measure - a statistical technique
known as Root Mean Squared Error - measures difference
between sound contours on a time axis. A drawback of this
method is that it measures differences between contours at
particular time points, rather than at "important" events (e.g.
changes in pitch). Furthermore, differences between two
utterances could occur due to timing, due to pitch or due to a
CAI-TR8 [02-2010]
5
© RNIB 2010
combination of these factors, and it is difficult to combine
measurements of these two different factors.
Clark and Dusterhoff trialled two alternative measures - to allow for
the idea that differences could be due to various factors, and to
focus on important pitch events. These measures were trialled,
alongside subjective user perceptions of differences between
utterances. Findings of the trials showed that whilst the new
measures were effective in identifying differences between the
synthetic and natural utterances, they did not add anything beyond
the commonly used RMSE. These findings highlight the
complexity of measurements in this area.
Another possible improvement to objective measures, suggested
by Hirst et al (1998) is to use a corpus-based evaluation
technique, comparing the synthetic utterance to a range of natural
reference utterances, rather than just one. The authors suggest
that this would improve reliability of data.
In summary, objective measures can be a useful means of
evaluation for synthetic voices, and can be a very efficient form of
testing. However, such techniques require specialist knowledge
and measures are still in development.
User testing
Evaluations of synthetic voices often involve testing the voice with
those who are ultimately going to use it. Testing with users is
beneficial for understanding how the voice will work in a particular
application (Braga, Freitas and Barros, 2002). There are a
variety of approaches to user testing, falling into two broad
categories - performance measures and opinion measures.
Performance measures (such as intelligibility), give information
about whether the voice is sufficiently accurate for people to
understand it; and opinion measures (such as acceptability) give
information on users' subjective judgements of the voice.
User testing in the evaluation of synthetic voices has various
benefits. As a 'real world' test of the voice with those who will use
it, such testing gives insight into whether the voice can be
understood, and whether it is accepted. It could be argued that
this is all that matters when testing a voice, as if end users can use
and understand it, the technical accuracy of the voice may not be
important. Indeed, Campbell (2007) suggests that users are more
CAI-TR8 [02-2010]
6
© RNIB 2010
interested in the 'performance' of the voice (such as how pleasant
it is, how suitable for the context and so on) than they are in the
technical success of the developers. By testing with users, real
world reactions to a voice can be gathered, which can be
particularly informative for plans of future products or services.
Of course there are downsides to user testing too. It is time
consuming and expensive to organise (Clark and Dusterhoff,
1999). Also, as individuals differ in both ability and opinion, it can
be difficult to draw conclusions from diverse user data.
Performance measures
Performance measures test listeners' "reading performance" when
reading with a synthetic voice. This is a way of testing how well
the voice conveys information, or it's intelligibility. There are a
number of intelligibility tests, which differ on various factors. For
example, some have closed answers (where listeners select what
they heard from multiple choice options), and other have free
answers (where listeners simply report what they think they heard).
Furthermore, some tests measure intelligibility at phoneme level
(testing whether listeners can tell the difference between sounds)
whilst others test intelligibility at word level (evaluating listeners
ability to understand words). Testing with longer pieces of text
(whole sentences) also allows evaluation of prosody (the nonverbal aspects of speech such as intonation, rhythm, pauses etc)
(Benoît, Grice and Hazan, 1996)
To demonstrate some of these differences, two commonly used
tests will be discussed in detail. These are the diagnostic rhyme
test (DRT) and the semantically unpredictable sentences test
(SUS).
Diagnostic Rhyme Test (DRT)
The diagnostic rhyme test is a closed answer test which tests
intelligibility at phoneme level. Listeners are presented with
monosyllabic words which differ only in the first consonant, and
have to choose which word they have heard from pairs (for
example, pin/fin, hit/bit).
Braga et al (2002) discuss the complexity of constructing the test
set of word pairs for the DRT. The test set should reflect common
syllabic trends in the language, considering likelihood of the
consonant appearing at the beginning or the end of a word.
CAI-TR8 [02-2010]
7
© RNIB 2010
Furthermore, pairs should be constructed so as to enable testing
of the intelligibility of a variety of phonetic features. For the English
version of the test, these features are voicing, nasality, sustension,
sibilation, graveness and compactness (these features relate to
things like the vibration of the vocal chords, the position of the
tongue and lips and so on).
Results from the DRT can be evaluated in a variety of ways, from
simply looking at the number of correct responses, to analysing
confusion between particular phonetic features. This is useful as it
gives not only an overall impression of intelligibility but can also
identify areas where confusions occur (Braga et al, 2002).
Furthermore, the test can be carried out with a variety of voices
and performance easily compared.
Semantically Unpredictable Sentences (SUS)
The Semantically Unpredictable Sentences (SUS) test is a free
answer test evaluating intelligibility at word level. Listeners are
presented with sentences which are syntactically normal, but
semantically abnormal. That is, the sentences use the right class
of word but which may not make sense in context. For example,
"She ate the sky". Listeners are presented with the sentence and
asked to write down what they think they heard.
Using semantically unpredictable sentences allows evaluators to
test intelligibility of words without the context of the sentence
cueing listeners to expect a particular word. Furthermore, it is
possible to generate a huge number of these sentences, which
reduces the need to re-use sentences (which could cause learning
effects). Benoît et al (1996) discuss the process of constructing a
test set for the SUS. They recommend using a variety of sentence
structures. There are various rules for inclusion in each set, and a
computer is used to randomly generate sentences based on
frequency of use of words. Benoît et al (1996) suggest the use of
minisyllabic words (the shortest available words in their class) to
further reduce contextual cues. They outline a procedure for
running the SUS, suggesting that a consistent approach across
tests will make it easier (and more accurate) to compare
performance across different synthetic voices.
Results from the test consist of percentage of correctly identified
sentences, both as a percentage of the full set and as a
percentage for each sentence structure.
CAI-TR8 [02-2010]
8
© RNIB 2010
It must be noted that whilst the SUS is useful for comparing
intelligibility of different systems, setting up the procedure is
complex and must be done carefully to ensure results are
comparable (Benoît et al, 1996).
Opinion measures
User testing with synthetic voices is not just about how well the
listener can understand the voice, but also the listener's opinion of
the voice. A widely used opinion measure in evaluations of
synthetic voices is naturalness, that is, the extent to which the
voice sounds human. The most common test is the Mean Opinion
Score (MOS). This test involves a large number of participants
who listen to a set of sentences presented in synthetic speech and
rate them on a simple 5 point scale (excellent - bad). Scores are
then averaged across the group (Francis and Nusbaum, 1999).
Generally speaking, more natural-sounding voices are more
accepted (Stevens, Lees, Vonwiller and Burnham, 2005).
However, research suggests that this may depend on context.
Francis and Nusbaum (1999) report on a study in which users
preferred an unnatural sounding voice in the context of telephone
banking, as they reasoned they would rather a computer knew
their bank balance than a real person.
Whilst naturalness is a widely used measure, some researchers
suggest it is not ideal. Campbell (2007) suggests that 'believability'
may be preferable as in some cases a voice doesn't need to be
natural (for example, in cartoons). Other measures used include
acceptability or likeability (Francis and Nusbaum, 1999).
Whilst many people may have preconceptions about synthetic
speech which may make them reluctant to use it, research
suggests that people tend to get used to it. Hjelmquist, Jansson
and Torrell (1990) found this to be the case with blind and partially
sighted people reading synthetic daily newspapers. Whilst it took
participants some time to become accustomed to the synthesised
speech, over time they reported it to be good quality. Indeed,
research with performance measures also shows an improvement
in understanding of synthetic voices following practice
(Venkatagiri, 1994; Rhyne, 1982).
CAI-TR8 [02-2010]
9
© RNIB 2010
Feature comparisons
A common way in which people carry out evaluations is to simply
compare products on a list of desired features or functionality.
Whilst this is not a formal testing procedure discussed in the
literature, it is acknowledged that all systems have their own
strengths and weaknesses and sometimes trade-offs need to be
made depending on what is most important for a particular
application (Campbell, 2007). Indeed, the common feature
comparisons test could also be described as an 'adequacy
evaluation' (Furui, 2007). Furui describes an adequacy evaluation
as "determining the fitness of a system for a purpose: does it meet
the requirements, and if so, how well and at what cost?" (p.23)
This method of evaluation may well be the most common form of
evaluating synthetic voices. As such a test does not require
technical expertise or complex set up, it is accessible to anyone
working with synthetic voices. It would be beneficial to have a
standardised procedure for how to carry out such evaluations, to
act as a guide.
Conclusion
This paper has outlined a range of different methods for evaluating
synthetic voices. Different methods are likely to be used by
different people, at different stages in the development of a
synthetic voice product. Each method is useful in its own way, and
it is important to consider the purpose of the evaluation when
choosing which evaluation method to use. Some methods require
technical expertise or expert knowledge, whereas others can be
used by anyone.
This paper supports the development of a protocol for evaluating
synthetic voices, which aims to outline a simple, easy to use
procedure for evaluating synthetic voices. The protocol combines
a range of evaluation methods and allows those carrying out
evaluations to pick and choose which aspects of the voice are
important to their evaluation.
The synthetic speech evaluation protocol is available on RNIB's
website at
http://www.rnib.org.uk/professionals/accessibleinformation/accessi
bleformats/audio/speech/Pages/synthetic_speech.aspx
CAI-TR8 [02-2010]
10
© RNIB 2010
References
Benoît, C., Grice, M., and Hazan, V. (1996). The SUS test: a
method for the assessment of text-to-speech synthesis intelligibility
using Semantically Unpredictable Sentences. Speech
Communication, 18, 381 - 392.
Braga, D., Freitas, D., and Barros, M.J. (2002). A DRT approach
for subjective evaluation of intelligibility in European Portuguese
Synthetic Speech. International Conference on SYSTEMS
SCIENCE - ICOSYS 2002, October 21-24, 2002, Rio de Janeiro,
Brazil.
Campbell, N. (2007). Evaluation of Speech Synthesis. In L.
Dybkjær, H. Hamsen and W. Minker. (Eds.) Evaluation of text and
speech systems. Springer: The Netherlands, p29-64
Clark, R.A.J. and Dusterhoff, K.E. (1999). Objective methods for
evaluating synthetic intonation. In Proceedings of Eurospeech,
1999. Volume 4, p1623 - 1626.
Francis, A.L., and Nusbaum, H.C. (1999). Evaluating the quality of
synthetic speech. In D. Gardner-Bonneau, (Ed) Human Factors
and Voice Interactive systems. Boston: Kluwer Academic. p63 97.
Furui, S. (2007). Speech and speaker recognition evaluation. In
L. Dybkjær, J. Hamsen and W. Minker (Eds.) Evaluation of text
and speech systems. Springer: The Netherlands, p 1 -27.
Hirst, D., Rilliard, A., and Aubergé, V. (1998). Comparison of
subjective evaluation and an objective evaluation metric for
prosody in text-to-speech synthetsis. In ESCA Workshop on
Speech Synthesis. Jenolan Caves, Australia. 1998.
Hjelmquist, E., Jansson, B., and Torrell, G. (1990). Computeroriented technology for blind readers. Journal of Visual
Impairment and Blindness, 17, 210 - 215.
Morton K. (1991). Expectations for Assessment Techniques
Applied to Speech Synthesis. Proceedings of the Institute of
Acoustics vol. 13, Part 2.
CAI-TR8 [02-2010]
11
© RNIB 2010
Rhyne, J.M. (1982). Comprehension of synthetic speech by blind
children. Journal of Visual Impairment and Blindness, 10 (10), 313
- 316.
Stevens, C., Lees, N., Vonwiller, J., and Burnham, D. (2005). Online experimental methods to evaluate text-to-speech (TTS)
synthesis: effects of voice gender and signal quality on
intelligibility, naturalness and preference. Computer speech and
language, 19, 129 - 146.
Venkatagiri, H.S. (1994). Effect of sentence length and exposure
on the intelligibility of synthesized speech. Augmentative and
Alternative Communication, 10, 96 - 104.
CAI-TR8 [02-2010]
12
Download