RNIB Centre for Accessible Information (CAI) Technical report # 8 Review of methods for evaluating synthetic speech Published by: RNIB Centre for Accessible Information (CAI), 58-72 John Bright Street, Birmingham, B1 1BN, UK Commissioned by: As publisher Authors: (Note: After corresponding author, authors are listed alphabetically, or in order of contribution) Heather Cryer* and Sarah Home *For correspondence Tel: 0121 665 4211 Email: heather.cryer@rnib.org.uk Date: 17 February 2010 Document reference: CAI-TR8 [02-2010] Sensitivity: Internal and full public access Copyright: RNIB 2010 © RNIB 2010 Citation guidance: Cryer, H., and Home, S. (2010). Review of methods for evaluating synthetic speech. RNIB Centre for Accessible Information, Birmingham: Technical report #8. Acknowledgements: Thanks to Sarah Morley Wilkins for support in this project. CAI-TR8 [02-2010] 2 © RNIB 2010 Review of methods for evaluating synthetic speech RNIB Centre for Accessible Information (CAI) Prepared by: Heather Cryer (Research Officer, CAI) FINAL version © RNIB 17 February 2010 Table of contents Introduction ............................................................................ 3 Objective/Acoustic measures ................................................. 4 User testing ............................................................................ 6 Performance measures ...................................................... 7 Opinion measures .............................................................. 9 Feature comparisons............................................................ 10 Conclusion ........................................................................... 10 References ........................................................................... 11 Introduction This paper provides background to the development of an evaluation protocol for synthetic voices. The aim of the project was to develop a protocol which could be used by staff working with synthetic voices to conduct systematic evaluations and keep useful records detailing their findings. In order to do this, a review of existing literature was carried out to determine the various methods used for evaluating synthetic voices. This paper synthesises findings of the literature review. There are a number of different approaches to evaluating synthetic voices. The type of evaluation carried out is likely to depend on the purpose of the evaluation, and the specific application for which the synthetic voice is intended. The key approaches to evaluation discussed here are: Objective measures User testing (performance measures and subjective measures) Feature comparisons CAI-TR8 [02-2010] 3 © RNIB 2010 Whilst these approaches may seem very different, in fact they complement each other, and different types of evaluation may be carried out at different stages in the development and use of a synthetic voice (Francis and Nusbaum, 1999). For example, objective/acoustic measures are used particularly by developers as diagnostic tools to refine the voice, ensuring it is as good as possible. Similarly, testing user performance in listening to the voice enables further development to make improvements. Subjective user testing is more useful for someone considering the voice for use in a product or service, to find out whether users are happy with the voice. Similarly, feature comparisons are useful to those considering investing in a voice for their application, to evaluate whether the voice has the desired features and whether it fits in with existing systems and processes. The following review aims to give an overview of these approaches, explaining when they may be used, advantages and disadvantages of each approach, and providing some insight into the complexity of the process. Objective/Acoustic measures Historically, synthetic speech evaluations tended to focus on whether or not the voice was intelligible. As technology has progressed, synthetic voices have become more sophisticated. This means that intelligibility is generally a given, so the focus of evaluations has shifted towards how closely a synthetic voice mirrors a human voice (Campbell, 2007; Francis and Nusbaum, 1999; Morton, 1991). This is partly based on subjective user evaluation (discussed more later), but is also deemed important in an objective sense, in terms of measuring whether or not an utterance from a synthetic voice acoustically matches the same utterance in human speech. This type of objective/acoustic measure forms a large part of synthetic speech evaluation, particularly for developers of synthetic voices. By measuring where voices differ from a human utterance, problem areas can be identified which can then be refined to make improvements. Objective/acoustic measures are beneficial for a number of reasons. Firstly, by their very nature, objective measures offer a clear measurement of how a voice is performing, and can be used to diagnose problem areas needing development. Also, compared CAI-TR8 [02-2010] 4 © RNIB 2010 to the time and cost involved in user testing, objective measures can be an efficient form of evaluation (Hirst, Rilliard and Aubergé, 1998; Clark and Dusterhoff, 1999). The negatives of objective/acoustic evaluation are that findings from such evaluations may not always match up with listener perceptions. Researchers have found that some objective measures may be over sensitive, compared to the human ear. Some studies show differences between natural and synthetic voices (highlighted by objective measures) which were not perceived by users (Clark and Dusterhoff, 1999). The opposite may also be true, in that synthetic samples may be perfect in acoustic terms, but may be perceived as unnatural by listeners. Morton (1991) explains this based on the processing required to listen to speech. Morton suggests that human speakers naturally vary their speech, with the aim of being understood, and that this affects the way human speech is processed by listeners. As a synthesiser does not have this capability, the processing needs of the listener are not accounted for. This may cause the listener to find the synthetic speech unnatural or difficult to understand. Objective/acoustic measures can be used in evaluation of different aspects of a voice, such as prosodic features or intonation. Examples of acoustic measures which may be of interest include fundamental frequency, segmental duration and intensity. A common approach is to use statistical methods (Root Mean Squared Error - RMSE) to model the expected performance of the voice against actual performance, measuring the accuracy of the synthetic voice against a natural voice. Whilst objective measures have been described as more efficient than user testing, of course specialist knowledge is required in order to run these tests. Clark and Dusterhoff (1999) report on trials of three different objective measures aimed at investigating differences in intonation, and highlight some of the complexities involved. The generally accepted measure - a statistical technique known as Root Mean Squared Error - measures difference between sound contours on a time axis. A drawback of this method is that it measures differences between contours at particular time points, rather than at "important" events (e.g. changes in pitch). Furthermore, differences between two utterances could occur due to timing, due to pitch or due to a CAI-TR8 [02-2010] 5 © RNIB 2010 combination of these factors, and it is difficult to combine measurements of these two different factors. Clark and Dusterhoff trialled two alternative measures - to allow for the idea that differences could be due to various factors, and to focus on important pitch events. These measures were trialled, alongside subjective user perceptions of differences between utterances. Findings of the trials showed that whilst the new measures were effective in identifying differences between the synthetic and natural utterances, they did not add anything beyond the commonly used RMSE. These findings highlight the complexity of measurements in this area. Another possible improvement to objective measures, suggested by Hirst et al (1998) is to use a corpus-based evaluation technique, comparing the synthetic utterance to a range of natural reference utterances, rather than just one. The authors suggest that this would improve reliability of data. In summary, objective measures can be a useful means of evaluation for synthetic voices, and can be a very efficient form of testing. However, such techniques require specialist knowledge and measures are still in development. User testing Evaluations of synthetic voices often involve testing the voice with those who are ultimately going to use it. Testing with users is beneficial for understanding how the voice will work in a particular application (Braga, Freitas and Barros, 2002). There are a variety of approaches to user testing, falling into two broad categories - performance measures and opinion measures. Performance measures (such as intelligibility), give information about whether the voice is sufficiently accurate for people to understand it; and opinion measures (such as acceptability) give information on users' subjective judgements of the voice. User testing in the evaluation of synthetic voices has various benefits. As a 'real world' test of the voice with those who will use it, such testing gives insight into whether the voice can be understood, and whether it is accepted. It could be argued that this is all that matters when testing a voice, as if end users can use and understand it, the technical accuracy of the voice may not be important. Indeed, Campbell (2007) suggests that users are more CAI-TR8 [02-2010] 6 © RNIB 2010 interested in the 'performance' of the voice (such as how pleasant it is, how suitable for the context and so on) than they are in the technical success of the developers. By testing with users, real world reactions to a voice can be gathered, which can be particularly informative for plans of future products or services. Of course there are downsides to user testing too. It is time consuming and expensive to organise (Clark and Dusterhoff, 1999). Also, as individuals differ in both ability and opinion, it can be difficult to draw conclusions from diverse user data. Performance measures Performance measures test listeners' "reading performance" when reading with a synthetic voice. This is a way of testing how well the voice conveys information, or it's intelligibility. There are a number of intelligibility tests, which differ on various factors. For example, some have closed answers (where listeners select what they heard from multiple choice options), and other have free answers (where listeners simply report what they think they heard). Furthermore, some tests measure intelligibility at phoneme level (testing whether listeners can tell the difference between sounds) whilst others test intelligibility at word level (evaluating listeners ability to understand words). Testing with longer pieces of text (whole sentences) also allows evaluation of prosody (the nonverbal aspects of speech such as intonation, rhythm, pauses etc) (Benoît, Grice and Hazan, 1996) To demonstrate some of these differences, two commonly used tests will be discussed in detail. These are the diagnostic rhyme test (DRT) and the semantically unpredictable sentences test (SUS). Diagnostic Rhyme Test (DRT) The diagnostic rhyme test is a closed answer test which tests intelligibility at phoneme level. Listeners are presented with monosyllabic words which differ only in the first consonant, and have to choose which word they have heard from pairs (for example, pin/fin, hit/bit). Braga et al (2002) discuss the complexity of constructing the test set of word pairs for the DRT. The test set should reflect common syllabic trends in the language, considering likelihood of the consonant appearing at the beginning or the end of a word. CAI-TR8 [02-2010] 7 © RNIB 2010 Furthermore, pairs should be constructed so as to enable testing of the intelligibility of a variety of phonetic features. For the English version of the test, these features are voicing, nasality, sustension, sibilation, graveness and compactness (these features relate to things like the vibration of the vocal chords, the position of the tongue and lips and so on). Results from the DRT can be evaluated in a variety of ways, from simply looking at the number of correct responses, to analysing confusion between particular phonetic features. This is useful as it gives not only an overall impression of intelligibility but can also identify areas where confusions occur (Braga et al, 2002). Furthermore, the test can be carried out with a variety of voices and performance easily compared. Semantically Unpredictable Sentences (SUS) The Semantically Unpredictable Sentences (SUS) test is a free answer test evaluating intelligibility at word level. Listeners are presented with sentences which are syntactically normal, but semantically abnormal. That is, the sentences use the right class of word but which may not make sense in context. For example, "She ate the sky". Listeners are presented with the sentence and asked to write down what they think they heard. Using semantically unpredictable sentences allows evaluators to test intelligibility of words without the context of the sentence cueing listeners to expect a particular word. Furthermore, it is possible to generate a huge number of these sentences, which reduces the need to re-use sentences (which could cause learning effects). Benoît et al (1996) discuss the process of constructing a test set for the SUS. They recommend using a variety of sentence structures. There are various rules for inclusion in each set, and a computer is used to randomly generate sentences based on frequency of use of words. Benoît et al (1996) suggest the use of minisyllabic words (the shortest available words in their class) to further reduce contextual cues. They outline a procedure for running the SUS, suggesting that a consistent approach across tests will make it easier (and more accurate) to compare performance across different synthetic voices. Results from the test consist of percentage of correctly identified sentences, both as a percentage of the full set and as a percentage for each sentence structure. CAI-TR8 [02-2010] 8 © RNIB 2010 It must be noted that whilst the SUS is useful for comparing intelligibility of different systems, setting up the procedure is complex and must be done carefully to ensure results are comparable (Benoît et al, 1996). Opinion measures User testing with synthetic voices is not just about how well the listener can understand the voice, but also the listener's opinion of the voice. A widely used opinion measure in evaluations of synthetic voices is naturalness, that is, the extent to which the voice sounds human. The most common test is the Mean Opinion Score (MOS). This test involves a large number of participants who listen to a set of sentences presented in synthetic speech and rate them on a simple 5 point scale (excellent - bad). Scores are then averaged across the group (Francis and Nusbaum, 1999). Generally speaking, more natural-sounding voices are more accepted (Stevens, Lees, Vonwiller and Burnham, 2005). However, research suggests that this may depend on context. Francis and Nusbaum (1999) report on a study in which users preferred an unnatural sounding voice in the context of telephone banking, as they reasoned they would rather a computer knew their bank balance than a real person. Whilst naturalness is a widely used measure, some researchers suggest it is not ideal. Campbell (2007) suggests that 'believability' may be preferable as in some cases a voice doesn't need to be natural (for example, in cartoons). Other measures used include acceptability or likeability (Francis and Nusbaum, 1999). Whilst many people may have preconceptions about synthetic speech which may make them reluctant to use it, research suggests that people tend to get used to it. Hjelmquist, Jansson and Torrell (1990) found this to be the case with blind and partially sighted people reading synthetic daily newspapers. Whilst it took participants some time to become accustomed to the synthesised speech, over time they reported it to be good quality. Indeed, research with performance measures also shows an improvement in understanding of synthetic voices following practice (Venkatagiri, 1994; Rhyne, 1982). CAI-TR8 [02-2010] 9 © RNIB 2010 Feature comparisons A common way in which people carry out evaluations is to simply compare products on a list of desired features or functionality. Whilst this is not a formal testing procedure discussed in the literature, it is acknowledged that all systems have their own strengths and weaknesses and sometimes trade-offs need to be made depending on what is most important for a particular application (Campbell, 2007). Indeed, the common feature comparisons test could also be described as an 'adequacy evaluation' (Furui, 2007). Furui describes an adequacy evaluation as "determining the fitness of a system for a purpose: does it meet the requirements, and if so, how well and at what cost?" (p.23) This method of evaluation may well be the most common form of evaluating synthetic voices. As such a test does not require technical expertise or complex set up, it is accessible to anyone working with synthetic voices. It would be beneficial to have a standardised procedure for how to carry out such evaluations, to act as a guide. Conclusion This paper has outlined a range of different methods for evaluating synthetic voices. Different methods are likely to be used by different people, at different stages in the development of a synthetic voice product. Each method is useful in its own way, and it is important to consider the purpose of the evaluation when choosing which evaluation method to use. Some methods require technical expertise or expert knowledge, whereas others can be used by anyone. This paper supports the development of a protocol for evaluating synthetic voices, which aims to outline a simple, easy to use procedure for evaluating synthetic voices. The protocol combines a range of evaluation methods and allows those carrying out evaluations to pick and choose which aspects of the voice are important to their evaluation. The synthetic speech evaluation protocol is available on RNIB's website at http://www.rnib.org.uk/professionals/accessibleinformation/accessi bleformats/audio/speech/Pages/synthetic_speech.aspx CAI-TR8 [02-2010] 10 © RNIB 2010 References Benoît, C., Grice, M., and Hazan, V. (1996). The SUS test: a method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences. Speech Communication, 18, 381 - 392. Braga, D., Freitas, D., and Barros, M.J. (2002). A DRT approach for subjective evaluation of intelligibility in European Portuguese Synthetic Speech. International Conference on SYSTEMS SCIENCE - ICOSYS 2002, October 21-24, 2002, Rio de Janeiro, Brazil. Campbell, N. (2007). Evaluation of Speech Synthesis. In L. Dybkjær, H. Hamsen and W. Minker. (Eds.) Evaluation of text and speech systems. Springer: The Netherlands, p29-64 Clark, R.A.J. and Dusterhoff, K.E. (1999). Objective methods for evaluating synthetic intonation. In Proceedings of Eurospeech, 1999. Volume 4, p1623 - 1626. Francis, A.L., and Nusbaum, H.C. (1999). Evaluating the quality of synthetic speech. In D. Gardner-Bonneau, (Ed) Human Factors and Voice Interactive systems. Boston: Kluwer Academic. p63 97. Furui, S. (2007). Speech and speaker recognition evaluation. In L. Dybkjær, J. Hamsen and W. Minker (Eds.) Evaluation of text and speech systems. Springer: The Netherlands, p 1 -27. Hirst, D., Rilliard, A., and Aubergé, V. (1998). Comparison of subjective evaluation and an objective evaluation metric for prosody in text-to-speech synthetsis. In ESCA Workshop on Speech Synthesis. Jenolan Caves, Australia. 1998. Hjelmquist, E., Jansson, B., and Torrell, G. (1990). Computeroriented technology for blind readers. Journal of Visual Impairment and Blindness, 17, 210 - 215. Morton K. (1991). Expectations for Assessment Techniques Applied to Speech Synthesis. Proceedings of the Institute of Acoustics vol. 13, Part 2. CAI-TR8 [02-2010] 11 © RNIB 2010 Rhyne, J.M. (1982). Comprehension of synthetic speech by blind children. Journal of Visual Impairment and Blindness, 10 (10), 313 - 316. Stevens, C., Lees, N., Vonwiller, J., and Burnham, D. (2005). Online experimental methods to evaluate text-to-speech (TTS) synthesis: effects of voice gender and signal quality on intelligibility, naturalness and preference. Computer speech and language, 19, 129 - 146. Venkatagiri, H.S. (1994). Effect of sentence length and exposure on the intelligibility of synthesized speech. Augmentative and Alternative Communication, 10, 96 - 104. CAI-TR8 [02-2010] 12