DRAFT Article Review: Spoken Dialogue Technology: Enabling the Conversational User MICHAEL F. MCTEAR University of Ulster ACM Computing Surveys (CSUR) Volume 34 , Issue 1 (March 2002) Pages: 90 - 169 Year of Publication: 2002 ISSN:0360-0300 This article has been reviewed by Gregory A. Vaughn Sr., in partial fulfillment of course DC891 course, Summer of 2004 DRAFT “Spoken dialogue systems allow users to interact with computer-based applications such as databases and expert systems by using natural spoken language. The origins of spoken dialogue systems can be traced back to Artificial Intelligence research in the 1950s concerned with developing conversational interfaces. However, it is only within the last decade or so, with major advances in speech technology, that large-scale working systems have been developed and, in some cases, introduced into commercial environments. As a result many major telecommunications and software companies have become aware of the potential for spoken dialogue technology to provide solutions in newly developing areas such as computer-telephony integration. Voice portals, which provide a speech-based interface between a telephone user and Web-based services, are the most recent application of spoken dialogue technology. This article describes the main components of the technology---speech recognition, language understanding, dialogue management, communication with an external source such as a database, language generation, speech synthesis, and shows how these component technologies can be integrated into a spoken dialogue system. The article describes in detail the methods that have been adopted in some well-known dialogue systems, explores different system architectures, considers issues of specification, design, and evaluation, reviews some currently available dialogue development toolkits, and outlines prospects for future development.”[1] Beyond his abstract, McTEAR presents this article as a detailed description and explanation of the most salient elements of spoken dialogue technology. His intended audience is both the computer scientist new to this technology as well as those experienced in the technology that wish to conduct further research in the area. In this paper McTEAR takes a different approach. Rather than focusing on spoken language systems currently in existence, he DRAFT focuses on the underlying technologies. His review is supplemented by the use of examples of specific systems that illustrate common issues/problems. [1] The review is broken down into logical units that cover: Spoken Dialogue System Definition Classification of Dialogue Systems by Control Strategies Components of a Spoken Dialogue System 1. Speech Recognition The conversion of an input speech utterance, consisting of a sequence of acoustic-phonetic parameters, into a string of words. 2. Language Understanding The analysis of this string of words with the aim of producing a meaning representation for the recognized utterance that can be used by the dialogue management component. 3. Dialogue management The control of the interaction between the system and the user, including the coordination of the other components of the system. 4. Communication with external system: For example, with a database system, expert system, or other computer application. 5. Response generation: The specification of the message to be output by the system. 6. Speech output: The use of text-to-speech synthesis or prerecorded speech to output the system’s message. Review a number of Architectures and Dialogue Control Strategies 1. Speech Corpora 2. Wizard-of-Oz studies 3. Speech and Language Understanding Components 4. Guidelines and Standards for Spoken language Systems Research Approach The author has conducted a formal survey of professional literature (Literature Review) in support of his hypotheses that practical and efficient research needs to be conducted that incorporates an amalgam or fusion of a number of the currently available technologies to produce practical solutions to problems in spoken dialogue technology. I believe the work is of importance to researchers in this area, and to me specifically, because it brings under one cover a cross section of all of the more important work on this topic. DRAFT It therefore becomes an excellent jump-off point for any serious research on spoken language technology. Too, it presents a cause-to-pause to rethink the approach(es) that can or should be taken, and the limits that may be applied. Related Research Problems and Future Research [1] Research in robotics on gesture free spoken dialogue has revealed that current systems can not handle deictic references (references generated with pronouns like, here, there, that, etc.). These type of references have no meaning unless that are tied to some concrete situation. Usually these references are resolved by including physical gestures (pointing in the direction you wish the robot to go). A research question can be posed to see [3] “how far can one go in resolving deictic reference without gesture recognition?.” Another area for examination might be universal quantification of negation. Current system can not properly interpret an respond to questions like, (robot sees a soda can), “Are the cans that you see all red?.” Top, instructions like, “Don’t do it” can not be properly interpreted. Only simply phrases like stop, stop going to, etc can be used. Lastly, while there are many instances where communication with a robot by a number of users may be desirable, only one user at a time dialogue with current spoken language systems. Research that prioritizes natural language inputs and making sense of seemingly contradictory commands become major challenges. The author offers multiple paths for future research in the area of spoken language/dialogue technology. These research initiatives may focus more on the different ways that natural language components can be integrated and spoken dialogue systems can be deployed as marketable real word applications. Research in the area of AI for instance, planning, may be tightly integrated with speech act theory to produce models of conversational agency. Too, more sophisticated dialogue DRAFT managers based on research in text based dialogue managers may be developed. Another path suggest that statistical techniques may be integrated with reinforcement learning algorithms. This integration would allow for “automated learning of optimal learning strategies”. Modeling dialogue as a Markov decision process is a popular technique for improving recognition and translation in text and may be viewed as an element in a state space effected by both user responses and system actions. Yet another path to be traversed is the World Wide Web. The authors believes that “there is a potential for applications using spoken dialogue technology to perform services such as home shopping, or to control program appliances around the home. Voice XML distributed applications are currently being developed and deployed that enable synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording spoken input, telephony, and mixed-initiative dialogues. These techniques, now accepted as a standard, may be deployed in spoken language applications that have real world usability. In summary the author has provided a list of potential spoken dialogue research: . “more robust speech recognition” 1. including the ability to perform well in noisy conditions 2. deal with out-of-vocabulary words 3. close integration with technologies for natural language the use of prosody in spoken dialogue systems 1. provide more naturally sounding output 2. assist recognition by identifying phrase boundaries as well as the functions of utterances. research concerned with component integration and with investigating the extent to which the language understanding and dialogue management components can compensate for deficiencies in speech recognition applicability of different technologies for particular application types, such as the costs and benefits of parsing using theoretically motivated grammars compared with robust and partial parsing and with more pragmatically driven methods such as concept spotting. investigation of the applicability of different technologies for particular application types DRAFT costs and benefits of parsing using theoretically motivated grammars compared with robust and with more pragmatically driven methods such as concept spotting. Studies of different approaches to dialogue management in relation to the requirements of an application indicating for example 1. where state-based methods are applicable 2. under which circumstances more complex approaches are required the incorporation of more sophisticated approaches to dialogue management deriving from AI-based research. research into the use of stochastic and machine learning techniques. the development of multimodal dialogues systems dialogue systems with Web integration In closing, the author concludes that much of the future research will most probably be pursued by commercial interests in an effort to deploy products that are both” marketable, profitable and that, within the constraints of the technology, can be made desirable by the consumer. . DRAFT References [1] Michael F McTear (2002) Spoken dialogue technology: enabling the conversational interface. ACM Computing Surveys, Volume 34, Issue 1 (March 2002), pp. 90 - 169. [2] D.C. Charles Hair, Review: Spoken dialogue technology: enabling the conversational user interface. ACM Computing Surveys (CSUR) [3] Human-Robot Interaction Through Gesture-Free Spoken Dialogue Autonomous Robots Volume 16, Issue 3 (May 2004), Pages: 239 – 257, Year of Publication: 2004 ISSN:0929-5593 DRAFT Citation Appendix Human-Robotic Interaction based On Spoken natural Language Dialogue, by Dimitris Spiliotopoulos, Ion Androutsopoulos, and Constantine D. Spyropoulos; Software and Knowledge Engineering Laboratory, Institute of Information and telecommunications, National Centre for Scientific Research. Luis Villaseñor , Manuel Montes , Jean Caelen, A model for the conversation multimodal manmachine: integration of speech and action, Proceedings of the Latin American conference on Human-computer interaction, August 17-20, 2003, Rio de Janeiro, Brazil Karen Ward , David G. Novick, Hands-free documentation, Proceedings of the 21st annual international conference on Documentation, October 12-15, 2003, San Francisco, CA, USA DRAFT Spoken Language Technology 103 locate these connectors, instructions for the first two actions are not required and so the system proceeds with instructions for the third action, which is confirmed in User3, and for the fourth action. Here the user requires further instructions, which are given in System5 with the action confirmed in User5. At this point the user asserts that the wire between 84 and 99 is connecting, so that the fifth instruction to connect the second end to 99 is not required. A further missing axiom is discovered, which leads the system to ask what the LED is displaying (System7). 3.4. Summary The examples presented in this section have illustrated three different types of dialogue control strategy. The selection of a dialogue control strategy determines the degree of flexibility possible in the dialogue and places requirements on the technologies employed for processing the user’s input and for correcting errors. There are many variations on the dialogue strategies illustrated here, and these will be discussed in greater detail in Section 5. The next section will examine the component technologies of spoken dialogue systems. 4. COMPONENTS OF A SPOKEN DIALOGUE SYSTEM A spoken dialogue system involves the integration of a number of components that typically provide the following functionalities [Wyard et al. 1996]: Speech recognition: The conversion of an input speech utterance, consisting of a sequence of acoustic-phonetic parameters, into a string of words. Language understanding: The analysis of this string of words with the aim of producing a meaning representation for the recognized utterance that can be used by the dialogue management component. Dialogue Management: The control of the interaction between the system and the user, including the coordination of the other components of the system. Communication with external system: For example, with a database system, expert system, or other computer application. Response generation: The specification of the message to be output by the system. DRAFT Speech output: The use of text-to-speech synthesis or prerecorded speech to output the system’s message. These components are examined in the following subsections in relation to their role in a spoken dialogue system (for a recent text on speech and language processing, see Jurafsky and Martin [2000]). 4.1. Speech Recognition The task of the speech recognition component of a spoken dialogue system is to convert the user’s input utterance, which consists of a continuous-time signal, into a sequence of discrete units such as phonemes (units of sound) or words. One major obstacle is the high degree of variability in the speech signal. This variability arises from the following factors: Linguistic variability: Effects on the speech signal caused by various linguistic phenomena. One example is coarticulation, that is, the fact that the same phoneme can have different acoustic realizations in different contexts, determined by the phonemes preceding and following the sound in question. Speaker variability: Differences between speakers, attributable to physical factors such as the shape of the vocal tract as well as factors such as age, gender, and regional origin; and differences within speakers, due to the fact that even the same words spoken on a different occasion by the same speaker tend to differ in terms of their acoustic properties. Physical factors such as tiredness, congested airways due to a cold, and changes of mood have a bearing on how words are pronounced, but the location of a word within a sentence and the degree of emphasis it is given are also factors which result in intraspeaker variability. ACM Computing Surveys, Vol. 34, No. 1, March 2002.