Conversational browser for accessing VoiceXML-based IVR services via multi-modal interactions on mobile devices JIEUN PARK, JIEUN KIM, JUNSUK PARK, DONGWON HAN Computer & Software Technology Lab. Electronics and Telecommunications Research Institute 161 Kajong-Dong, Yusong-Gu, Taejon, 305-350 KOREA Abstract: - Users can access VoiceXML-based IVR(Interactive Voice Response) systems using their mobile devices such as smart phones in anytime and at any place. Even though their mobile devices have small screens, they have to interact the services using voice-only modality. As a result of this uni-modality, the services have some fundamental problems as follows. 1) Users can not know what service items can be selected before TTS(Text-to-Speech) engine reads them. 2) Users always have to pay attentions not to forget items they can select and an item they will select. 3) Users cannot confirm if their speech input is valid or not at once. So, users always have to wait new questions from the server in order to confirm. Because of this inconvenience and cumbersomeness, users generally prefer to connect with a human operator directly. In this paper, we propose a new conversational browser that makes a user access the existing VoiceXML-based IVR(Interactive Voice Response) services via multi-modal interactions on a small screen. The conversational browser fetches voice-only web pages from web servers, converts the web pages to the multi-modal web pages by using a multi-modal markup language, and interprets the converted web pages. Key-Words: - VoiceXML, Multi-modal interaction, XHTML+Voice, WWW, Internet, Interactive voice response services, mobile devices 1 Introduction VoiceXML allows Web developers to use their exiting Java, XML, and Web development skills to design and implement IVR (Interactive Voice Response) services. They no longer have to learn the proprietary IVR programming language[1]. Many companies have changed their proprietary IVR systems to be written in VoiceXML. In general, the voice-only web services have simpler flows and have more specific domains than the existing visual-web services because they have to depend on voice-only modality. Users can access the voice-only web services using their mobile devices such as smart phones in anytime and at any place. Even though their mobile devices have small screens, they have to interact the services using voice-only modality. As a result of this uni-modality, the services have some fundamental problems as follows. 1) Users can not know what service items can be selected before TTS(Text-to-Speech) engine reads them. 2) Users always have to pay attentions not to forget items they can select and an item they will select. 3) Users cannot confirm if their speech input is valid or not at once. So, users always have to wait new questions from the server and reply to the server in order to confirm. Because of this inconvenience and cumbersomeness, users generally prefer to connect with an human operator directly. In this case, the original purpose of IVR services that is automation may be not achieved. In this paper, we propose a new conversational browser that makes user access the existing VoiceXML-based IVR services via multi-modal interactions on a small screen. Users can look at and listen to items they can select, and confirm the result of speech input through the displayed text instead of beginning a new dialogue with a server for confirmation. The organization of this paper is as follows. In section II, we describe the concept of multi-modal access to VoiceXML-based IVR services. In section III, we describe the conversational browser architecture and overall execution flows. In section IV and V, we describe related works and conclusions. 2 Multi-modal access VoiceXML-based IVR services to 2.1 VoiceXML-based IVR services Fig.1 illustrates an architectural model of VoiceXML-based IVR systems[1]. A conversation between a user and a system begins when a telephone call is initiated. Once a call is connected over the phone network, the VoiceXML Infrastructure acts as a “browser” and begins making a series of HTTP requests to a traditional Web server for VoiceXML, audio, grammar, and ECMAScript documents. The web server responds with these simple documents over HTTP. VoiceXML HTTP Request ------VoiceXML Response Phone Network End User IP Network VoiceXML Infrastructure ------Backend Web Systems Servers Fig. 1 The architecture model of VoiceXML-based IVR services Once retrieved, the actual VoiceXML “interpreter” within the VoiceXML Infrastructure executes the IVR applications and engages in a conversation with the end user. All software and resources necessary to “execute” a particular IVR service – such as voice recognition, computer-generated text-to-speech, ECMAScript execution etc – are embedded within the VoiceXML infrastructure. The following is a simple IVR service example[5] for ordering pizzas via a conversation between a human and a computer. Computer: How many pizzas would you like? Human: one Computer: What size of pizza would you like? Say one of small, medium, or large Human: medium Computer: Would you like extra cheese? Say one of yes or no. Human: yes. Computer: What vegetable toppings would you like? Say one of Olives, Mushrooms, Onions, or Peppers? Human: Um…help. Computer: What vegetable toppings would you like? Say one of Olives, Mushrooms, Onions, or peppers? Human: Mushrooms Computer: What meat toppings would you like? Say one of Bacon, Chicken, Ham, Meatball, Sausage, or Pepperoni Human: Help. Computer: Say one of Bacon, Chicken, Ham, Meatball, Sausage, or Pepperoni Human: Sausage Computer: Thank you for your orderings. Example 1. An IVR Service example In above example, users always have to answer to questions in a pre-defined order by a service provider and remember what items they can select. This voice-only interaction style is very inefficient because already many users have accessed these services using their mobile phone with a small screen. The small screen can support users to look at and listen to the information about what items he can select. Also it can support to select any item in any order if the selected item has no dependency with the others. In the following section, we will describe how to access an Example 1 service via multi-modal user interactions. 2.2 The access to IVR services multi-modal user interactions via Fig.2 shows the same service as Fig.1 via multi-modal user interactions. If the user clicks the textbox below the label “Quantity”, he can listen to “How many pizza….” like as the first dialogue in Example 1. At this time, the user can reply via voice or text input. If the user say “one”, the textbox will show a “1” character. If the user clicks the label “Size”, he can listen to “What size of Pizza …” like as the second dialogue in Example 1. At this time, the user can select one of radio buttons or say one of “small”, “medium”, or “large”. If the user say “small”, the radio button “small 12” will be selected. After replying all these questions, user clicks the button “Submit Pizza Order” for sending data to the servers. By supplementing visual modality in voice modality, users can know in advance what items can be selected and select their own favorite modality according to circumstances. Also users don’t need to answer additional questions for validating the speech input because users can know the recognition results through a displayed text at once. Click the textbox 1. How many Pizza would you like? Click the label 2. What size of Pizza would you like? Click the label 3. What toppings would you like? either a speech or a visual input. In Fig. 2, a “Quantity” value entered via speech is displayed in a visual element “textbox” below a label “Quantity” by using a “sync” element. A “cancel” element allows a user to stop a running speech dialogue when he doesn’t want the voice interactions. The structure of an XHTML+Voice Application is as Fig. 3[5]. Namespace Declaration Click the label 4. What vegetable toppings would …. Click the label 5. What meat toppings would like…. Visual Part (XHTML) Voice Part (VXML) Event Type Event Handler Fig. 2 A multi-modal web page By supplementing visual modality in voice modality, users can know in advance what items can be selected and select their own favorite modality according to circumstances. Also users don’t need to answer additional questions for validating the speech input because users can know the recognition results through a displayed text at once. Previous, there are many researches in the field of multi-modal browser[6, 7, 8, 9]. The researches focus on adding other modality (mainly voice) in the existing visual browser. But we think the effects of adding voice modality in visual-only web applications are not more powerful than adding visual modality in voice-only web applications – VoiceXML based IVR services. In the case of visual-only applications, whether the applications support voice-modality or not are not an indispensable problem. But in the case of voice-only applications, whether the applications support visual-modality or not are a very critical problem to users who already have accustomed to existing visual environments. We use a XHTML+Voice(X+V) markup language[4] for describing multi-modality that is proposed by IBM and Opera Software in W3C. X+V extends XHTML Basic with a subset of VoiceXML 2.0[10], XML-Events and a small extension module. In X+V, a modularized VoiceXML doesn’t include “non-local transfers” elements such as “exit”, “goto”, “link”, “script”, and “submit”, “menu” elements such as “menu”, “choice”, “enumerate”, and “object”, “root” elements such as “vxml” and “meta”, “telephone” elements such as “transfer” and “disconnect”. A small extension module includes important “sync” and “cancel” elements. The “sync” element supports synchronization of data entered via Processing Part Fig. 3. . The components of an XHTML+Voice A basic XHTML+Voice multi-modal application consists of a Namespace Declaration, a Visual Part, a Voice Part and a Processing Part. The Namespace Declaration for a typical XHTML+Voice applications is written in XHTML, with additional declarations for VoiceXML, and XML-events. The Visual Part of an XHTML+Voice application is XHTML code that is used to display the various form elements to the devices’ screen, if available. This can be ordinary XHTML code and may include check boxes and other form items that are found in a typical form. The Voice Part of an application is the section of code that is used to prompt the user for a desired field within a form. This VoiceXML code utilizes an external grammar to define the possible field choices. If there are many choices or combination of choices is required, the external grammar can be used to handle the valid combinations. The processing part of the application contains the code that is used to perform the needed instructions for each of the various events[5]. 3 The Conversational Browser 3.1 Conceptual Model The conversational browser transforms voice-only web pages into multi-modal web pages that include visual as well as voice elements to support multi-modal user interactions. By using our conversational browser, a mobile user with a small screen can access the existing VoiceXML-based IVR applications via voice as well as visual interactions. The effects of this supplement – adding visual interaction in voice-only applications is more convenient to users than the effects of the reverse case – adding voice interactions in visual-only applications. Fig.4 describes the conceptual model of our conversational browser. Event ------- Namespace ------- Execution modules – a VoiceXML Parser, a VoceXMLtoX+V Converter, and a XHTML+Voice Interpreter. Also, the conversational browser needs external systems for voice interactions such as a text to speech(TTS) engine and a speech recognizer, and a javascript engine for executing scripts. In the case of mobile devices, the TTS and speech recognizer have to be located in other platforms. XHTML XHTML+Voice VoiceXML VoiceXML Converting Fig.4 The Conceptual Model of Conversational Browser The conversational browser fetches VoiceXML pages from web servers that the user wants to access and analyzes what elements in the VoiceXML pages can be visualized. And then the conversational browser converts the original VoiceXML pages into the XHTML+Voice pages with the same scenario. The conversion process is divided into four parts – a VoiceXML part, a XHTML part, an Event part, and a Namespace part. In the VoiceXML part, the conversational browser transforms the elements in original VoiceXML pages into the modularized VoiceXML elements allowed in XHTML+Voice. In the XHTML part, the conversational browser adds new XHTML elements for visualizing some VoiceXML elements. For example, a “prompt” element in VoiceXML gives users to any information by saying through TTS engines. This “prompt” element could be changed to a “label” element in XHTML+Voice and showed in the form of a text string on the screen. The “field” element in VoiceXML is an input element to be gathered from the user. This could be changed to an input element with a “textbox” attribute in XHTML+Voice and showed in the form of a text box on the screen. The Event part is for combining visual and voice elements for synchronizing inputs generated from different modalities. The Namespace part is for making variables defined in the original VoiceXML pages to be used in new generated XHTML elements. The results of conversion produce XHTML+Voice pages including voice as well as visual elements. Finally, conversational browser executes the XHTML+Voice pages. 3.2 Architecture Fig.5 describes an architecture of the conversational browser. The conversational browser consists of three ------VoiceXML VoiceXML-to-X+V Converter VoiceXML Parser Modularized VoiceXML DOM Tree VoiceXML DOM Tree Created XHTML+Event DOM Tree XHTML+Voice Interpreter Sync VoiceXML Form Interpreter X+V Viewer Event Manager Text To Speech/Speech Recognizer JavaScript Engine Fig. 5 The architecture of conversational browser 3.2.1 VoiceXML Parser A VoiceXML Parser generates a DOM tree by parsing an input VoiceXML page. The generated DOM tree is transmitted to a VoiceXML-to-X+V converter that transforms a voice-only modal application into a multi-modal application. 3.2.2 VoiceXML-to-X+V Converter A VoiceXML-to-X+V converter creates a XHTML+Event dom tree by referencing the visual-able elements of the VoiceXML dom tree, delete and edit the some elements in the original VoiceXML dom tree. Fig. 6 shows roughly the VoceXML-to-X+V converter’s execution flows. First, the converter creates a new XHTML+Event dom tree that includes only a head element and a body element. And then, the converter executes as following steps until all elements in the VoiceXML dom tree are visited. In case of a “block” element that contains executable content, there are two cases – one includes a “pcdata” and the other includes a “submit” element. In case of the “pcdata”, the original meaning is the TTS engine to read the contents. Therefore the converter adds <P> elements in the created dom tree for visualizing in a text-form. In case of a “submit”, the meaning is to submit values to the specific server. The converter adds an <input> node in the created tree and delete the submit node in the original VoiceXML tree. The reason of deleting is that host language for multi-modal descriptions is not VoiceXML but XHTML. So, the same functions of “submit” is defined in XHTML. Interpreter executes the modularized VoiceXML dom tree and calls a TTS engine or a Speech Recognizer that is distributed in networked environments. An X+V Viewer shows the visual items on the screen and transmits user’s input event to the Event Manager. The Event Manager is to call the handler of the user’s input event – focus, click, etc. and synchronizes data entered via either a speech or a visual input. Create new DOM tree Node? in VoiceXML Block ? pcdata? Field ? submit? Menu? Add form node in X+V Prompt? Add label node in X+V Add P node in X+V Add input node in X+V Define Event & Handl er Delete Submit in VoiceXML Choice? Grammar Add input node in X+V Define Event & Handl er Add link node in X+V Prompt to Label Add input node in X+V Define Event & Handl er Fi Fig. 6 VoiceXML-to-X+V conerter’s execution flow Fi 3.2.2 VoiceXML-to-X+V Converter A “field” element specifies an input item to be gathered from the user. So the converter has to visualize this item for allowing a multi-modal input. The converter adds a new form node in the generated dom tree and add a new “label” node for informing what input the application want. And also, the converter adds a new “input” node in the created dom tree for gathering data from the user and connects the visual element with the field element in VoiceXML. The connection between a visual element and a voice element is needed for synchronizing both modalities. If a user reply by saying, the recognition result has to be showed in a text field. Even though in a reverse case, the input result also has to be transmitted to the VoiceXML Form Interpreter. In case of a “menu” element, the converter adds a new “label” node in the created dom tree for informing what the menu means. Also for a “choice” element, the converter adds the necessary numbers of link nodes in the created dom tree. In case of a “grammar” element, the converter adds a new “input” node in the created dom tree for selecting which items the user want and defines an event and its handler. 3.2.3 XHTML+Voice Interpreter An XHTML+Voivce(X+V) Interpreter consists of three parts – a VoiceXML Form Interpreter, an X+V Viewer, and an Event Manager. A VoiceXML Form Also, the X+V Interpreter calls the JavaScript Engine in the case of scripts included in the VoiceXML pages. 3.3 An Example This section describes the processing steps of the conversational browser using a simple example. If a user accesses a VoiceXML-based service (Example 2) on his mobile device, the user would listen a VoiceXML-based system saying - “Would you like coffee, tea, milk or nothing?” and then answer one of the items. <vxml xmlns… > <form> <field name=”drink”> <prompt> Would you like coffee, tea, milk, or nothing? </prompt> <grammar type=”application/srgs+vxml” root=”r2” version=”1.0”> <rule id=”r2” scope=”public”> <one-of> <item>coffee</item> <item>tea</item> <item>milk</item> <item>nothing</item> </one-of> </rule> </grammar> </field> <block> <submit next=”http://www.drink.example.com/drink2.asp”/> </block> </form> </vxml> Example 2. A simple VoiceXML file The processing steps of this service in the conversational browser are as follows. First, the VoiceXML parser generates a VoiceXML dom tree by parsing Example 2 as shown in Fig. 7-a. Fi stopped. This “cancel” button is used when the user doesn’t want voice modality. 4 Related Works Our research is related to two separated domains – automatically converting markup languages for web-based applications and multi-modal web browsers. Until recently, the researches for converting markup languages have mainly focused on HTML to VoiceXML. IBM developed a commercial product – WebSphere Transcoding Publisher that includes a HTML-to-VoiceXML transcoder[11]. Fig.7 Converting VoiceXML into XHTML+Voice Fi And then the VoiceXML-to-X+V converter inputs Fig. 7-a dom tree, changes it into a modularized dom tree as Fig.7-b, and generates a new XHTML+Event dom as Fig. 7-c. In this case, “field: drink” in Fig. 7-b is synchronized with “input:radio”s in Fig. 7-c. The XHTML+Voice Interpreter executes two dom trees in Fig. 7-b and 7-c as Fig. 8. Event: Click “Would you like…” Fig. 8. XHTML+Voice web page If the user clicks a label “Would you…”, he could listen the TTS engine’s saying as “Would you…”. At this time, users could say or click whatever he likes. If the user clicks a “Submit” button, the conversational browser sends the input value to a server. If user clicks a “Cancel” button, the VoiceXML interpreter is Frankie James proposed a framework for developing a set of guidelines for designing audio interfaces to HTML called the Auditory HTML access system[12]. Stuart Goose et al. proposed an approach for converting HTML to VoxML that is very similar to VoiceXML[3]. Gopal Gupata et al. have proposed an approach for translating HTML to VoiceXML that is based on denotational semantics and logic programming[13]. These researches considered only uni-modality – visual-only or voice-only. The previous works is to reuse the affluent HTML web contents through automatically converting mechanisms. Recently, the researches about multi-modal web browsers are based on Speech Application Language Tags(SALT)[14] or XHTML+Voice[4]. SALT tags are added to an HTML document so that users with a special browser can interact with the Web using graphics and voice at the same time. In XHTML+Voice, existing VoiceXML tags are integrated to XHTML. In this paper, we use XHTML+Voice markup language for describing multi-modal interactions because our goal is to access the existing VoiceXML-based IVR applications via multi-modal interactions. [4] Chirs Cross, Jonny Axelsson, Gerald McCobb, T.V. Raman, and Les Wilson, “XHTML+Voice Profile 1.1”, http://www-3.ibm.com. 5 Conclusions [6] X. Huang, A., et. al. “MiPAD: a multimodal intereaction prototype”, International Conference on Acoustics, Signal and Speech Processing, vol. 1, pp.7-11, May 2001. Even though many people use mobile devices with small screens, they always have accessed the VoiceXML-based IVR applications via voice-only modality. In the case of uni-modality, particularly voice-only modality, most users tend to connect a human operator directly by ignoring interaction with IVR systems. Because the existing IVR systems burden people with inconvenience such as paying attentions not to forget what items he select or repeatedly confirming a user’s speech input is valid or not. To resolve these problems on mobile devices, we suggested a new conversational browser that supports the user to access the existing VoiceXML-based IVR applications via multi-modal interactions. The conversational browser fetches the existing VoiceXML-based IVR applications and converts in the forms of multi-modal applications based on XHTML+Voice. By using this conversational browser, users can select which modality he use according to his circumstance, use visual and voice modalities at the same time, and know in advance what items could be selected without TTS engine’s saying. Until recently, many researches about multi-modality have focused on adding the voice modality in visual web applications. But the effects of adding the voice modality are not more powerful than the effects of the reverse case - adding the visual modality in the voice-only IVR applications. References: [1] Chetan Sharma, Jeff Kunins, VoiceXML: Strategies and Techniques for Effective Voice Application Development with VoiceXML2.0, John Wiley & Sons, Inc, 2002. [2] Zhuyan Shao, Robert Capra and Manuel A. Perez-Quinones, “Transcoding HTML to VoiceXML Using Annotations,” Proceedings of ICTAI 2003. [3] Goose S, Newman M, Schmidt C and Hue L, “Enhancing web accessibility via the Vox Portal and a web-hosted dynamic HTML to VoxML converter”, WWW9/Computer Networks, 33(1-6):583-592, 2000. [5] IBM Pervasive Computing, “Developing Multimodal Applications using XHTML+Voice”, January 2003. [7] Georg Niklfeld, et. al. “Multimodal Interface Architecture for Mobile Data Services”, Proceedings of TCMC2001 Workshop on Wearable Computing, Graz, 2001. [8] Zouheir Trabelsi, et. al. “Commerce and Businesses: A voice and ink XML multimodal architecture for mobile e-commerce systems”, Proceedings of the 2nd international workshop on Mobile commerce, September 2002. [9] Alpana Tiwari, et. al. “Conversational Multi-modal Browser: An Integrated Multi-modal Browser and Dialog Manager”, 2003 Symposium on Applications and the Internet, Jan. 2003, pp.27-31. [10] Scott McGlashan, et. al. “Voice Extensible Markup Language(VoiceXML) Version 2.0”, http://www.w3c.org/TR/2003. [11] Nichelle Hopson, “WebSphere Transcoding Publisher:HTML-to-VoiceXML Transcoder”, http://www7b.boulder.ibm.com/wsdd/library/techarti cles/0201_hopson/0201_hopson.html, January 2002. [12] James F., “Presenting HTML Structure in Audio: User Satisfaction with Audio Hyper-text”, Proceedings of the International Conference on Auditory Display, pp. 97-103, November 1997. [13] G. Gupta, O. El Khatib, M. F. Noamany, H. Guo, “Building the Tower of Babel: Converting XML Documents to VoiceXML for Accessibility”, Proceedings of the 7th International Conference on Computers helping people with special needs, pp. 267-272. [14] Speech Application Tags 1.0 Spec., http://www.saltforum.org/devforum/spec/SALT.1.0.a .asp, July 2002.