XML AND INTERFACE AGENTS R. Gabrielle Reed DIS - EDF 5906 December 11, 2002 Table of Contents INTRODUCTION TO XML ............................................................................................................................ 1 IS THERE AN XML AGENTS MARKUP LANGUAGE? .................................................................................... 1 MS AGENT ................................................................................................................................................. 3 XML GRAPHIC TOOLS ............................................................................................................................... 3 FLASH AND SWF ........................................................................................................................................ 4 SCALABLE VECTOR GRAPHICS (SVG) ........................................................................................................ 4 SYNCHRONIZED MULTIMEDIA INTEGRATION LANGUAGE (SMIL).............................................................. 5 MPEG-4 ..................................................................................................................................................... 5 VIRTUAL REALITY MODELING LANGUAGE (VRML) ................................................................................. 6 FACE MODELING LANGUAGE (FML).......................................................................................................... 6 VIRTUAL HUMAN MARKUP LANGUAGE (VHML) ...................................................................................... 6 SPEECH RECOGNITION ................................................................................................................................ 7 VOICE INTEGRATION .................................................................................................................................. 7 BEHAVIORAL EXPRESSION ANIMATION TOOLKIT (BEAT) ......................................................................... 8 AGENT CONTROL ....................................................................................................................................... 8 AGENT DEFINITION FORMAT ...................................................................................................................... 8 Introduction to XML XML is a markup language that allows the characteristics, attributes and functional aspects of an object to be recorded, so that it is easily read, modified and processed. It provides a smaller bandwidth interface compared with other graphical interfaces. Typically, a parser reads the XML and a transformer converts the information into a format so that it may be viewed. Some formats are viewed directly in a browser, e.g. Hypertext Markup Language (HTML), Graphic Interchange Format (gif) or Joint Photographic Experts Group (jpeg) images. Others use a viewer (plug in) as is the case for many graphics such as Flash in Small Web Format(SWF), Synchronized Multimedia Integration Language (SMIL), Scalable Vector Graphics (SVG), and Virtual Reality Modeling Language (VRML). Systems and Languages are also based on the Moving Picture Expert Group 4 (MPEG-4) standard, which is used for graphics, audio, video, delivery and control of the animation. Some of the XML languages require a framework to assemble the parts and function within a server, such as the Virtual Human Markup Language (VHML) used to integrate the dynamic animation with speech functions. Both MPEG and VHML have extensive pieces and plays different roles in the generation of the interface presence. There are systems with an integration of both. Is There an XML Agents Markup Language? XML agents are typically two types. One type is a program that runs behind the scene, and communicates to services and other agents to meet a particular goal. The other kind of agent is an agent with a personality, in some cases an Advanced Video Attribute 1 XML AND INTERFACE AGENTS Terminal Assembler and Recreator (AVATAR); it is visible, communicates and functions as an interface to an application. This paper addresses the interface agent. The anatomy (attributes and functionality) of an interface agent includes: 1. Agent controls 2. Visual appearance 3. Behavior such as animation including gestures and affect. 4. Voice Synthesis (text to speech recognition) 5. Balloon Text (Text Representation) 6. Speech Recognition Table 1: List of XML Languages Used for Particular Applications in comparison to Microsoft®(MS) Agent. Function MS Agent Component XML Equivalent Agent Control Agent Engine XML Parsers, Transformers, Framework, Architectures, w/ databases (DB) and Knowledge Bases(KB) Agent Definition Format (ADF) Dialog Construction Scripted statements in Dialog Manager Markup Language (DMML), web scripting language Artificial Intelligence Markup Language (AIML) Image Construction Constructed graphic Synchronized Multimedia Integration and Animation images Language (SMIL), Scalable Vector Graphics (SVG), Virtual Reality Modeling Language (VRML) Balloon Text Agent Character Editor Speech Markup Language(SML) Face Animation Constructed graphic Face Markup Language (FML), images by storyboard Face Animation Markup Language(FAML) Voice Lernout & Hauspie® Voice Mark up Language (VoiceML), (L&H) TruVoice Text- Voice control Markup Language (VoxML), To-Speech Engine Speech Markup Language (Sable) SML, Human Animation , Constructed graphic Human Markup Language(HML), Gestures, body images by storyboard Gesture Markup Language(GML), Body Animation Markup Language(BAML), Human Animation, Agent Character Editor Moving Picture Expert Group 4 (MPEG-4), Voice Integration or Virtual Human Markup Language (VHML) Agent Construction Speech Recognition Microsoft® Speech Speech Recognition Grammar Markup Recognition Engine (SRGM) Talk Markup Language (TalkML) JSpeech Grammar Format (JSGF) The use of XML in generating the interface agent can be applied in a number of ways: 1. Agent Control to invoke agent engine, or text to speech engines, issue animation or speech commands 2. Dialog Construction to pace and control the timing of the conversation, to play facial expressions and to facilitate the scripting and timing of dialogs 2 XML AND INTERFACE AGENTS 3. Image Construction and Animation to include dynamic or static images for animation generation 4. Face Animation to define the wire frame, surface, texture map, and motion of critical points on the face 5. Voice to generate the speech and display text such as the "balloon text" 6. Human Animation to define the body, face motions, gestures and "stories," and emotion 7. Human Animation and Voice Integration (Agent Construction) to include the timing of delivery of text, specify the form of responses and replies, and to integration of the animations and voice MS Agent The good example of an agent is the MS Agent. The MS Agent is an animation with lip-synch capabilities. Speech Synthesis and recognition is controlled through L&H speech engine. Behavior or Animation control using agent engine using the MS Character file. Bubble text is defined in the character files. Voice as wave, or text uses two different processes. Besides the core component applications, the following are artifacts needed to assemble an MS Agent: 1. Images of animation steps 2. Text scripts to be spoken (text to speech) or recorded scripts 3. Images of the 7 mouth shapes needed for lip-synching Using MS Agent character editor, behaviors are made by inserting images into functions defined as animations. For lip-synching, the mouth shapes are stored by phoneme as “overlays." The Agent is generated and it produces the “ASC” file, which is stored in the default “C:/Windows/MSAgent/Char/” directory. The agent, when used on a web page, is manipulated by an agent engine, which is loaded and remains available for requests until it is unloaded. The animations may be played using requests to the engine. An agent engine typically will receive requests from VBScript or JavaScript to play animation or to speak a phrase. The agent sends requests to the speech synthesizer to speak a particular text. The agent engine also processes the speak request to animate the sequenced lip-synching according to the words in the text. Reference: Microsoft Agents: http://www.microsoft.com/msagent/ XML Graphic Tools Graphic tools are used for animation. These are languages such as Synchronized Multimedia Integration Language (SMIL), and Scalable Vector Graphics (SVG). Animation categories of constructions currently available are puppets, talking heads and virtual actors (avatars). Some also can be constructed using Virtual Reality Modeling Language (VRML), Java3D and X3D. Reference: 3 XML AND INTERFACE AGENTS Graphic links: http://graphics.stanford.edu/~bregler/anim_links/ Flash and SWF Some animation applications use Small Web Format (SWF), which is a movie file format from Macromedia. SWF is used in Macromedia Flash to deliver graphics, animation and sound over the Internet. This has become the medium of choice for many publishers of multimedia. Flash has the ability to: Create and mask vector art Implement transition effects within a movie clip Incorporate dynamic text Create buttons and add navigation Implement stream and event sounds According to Macromedia, Flash has a small player, which gives it a wider distribution. Flash is included in every Netscape download. References: Open SWF file information: http://www.openswf.org/ Macromedia Flash Developer Tutorial on “Pal2”: C:\Program Files\Macromedia\Flash MX\Help\Flash\ContextHelp_tut1.htm http://www.macromedia.com/support/general/ts/documents/sw_flash_differences.htm Scalable Vector Graphics (SVG) SVG is a language used for describing two-dimensional graphics in XML. SVG allows for three types of graphic objects: vector graphic shapes (e.g., paths consisting of straight lines and curves), images and text. Graphical objects can be grouped, styled, transformed and composed into previously rendered objects. The feature set includes nested transformations, clipping paths, alpha masks, filter effects and template objects. SVG drawings can be interactive with mouse-overs and on-clicks. It is also dynamic by including timing and transformations. Animations can be defined and triggered either declaratively (i.e., by embedding SVG animation elements in SVG content) or via scripting. Animations are constructed by using images, basic shapes, canvases and fill. Adobe Illustrator 10 can create the images and save them as SVG. Animations can be timespliced; Animations may be represented by an object moving along a path during each time period. Batik is the open source SVG tool from Apache.org. The static SVG tags are supported. The animation and dynamic tags are being developed. As of August 29, 2002, interactivity is limited to text events and keyboard events. Descriptions for Motion are undergoing testing. To see the Batik demos, one needs to install the Java Runtime environment. References: Scalable Vector Graphics (SVG) 1.0 Specification: < http://www.w3.org/TR/SVG/> SVG 1.0 Animation specifications < http://www.w3.org/TR/SVG/animate.html > 4 XML AND INTERFACE AGENTS A selection of simple SVG Examples: http://www.xmlpitstop.com/Default.asp?DataType=SVGEXAMPLES Adobe SVG Viewer Adobe SVG Viewer: http://www.adobe.com/svg/viewer/install/main.html Adobe demos: http://www.adobe.com/svg/demos/main.html Batik Info and Demos: http://xml.apache.org/batik/ Java Runtime environment: http://java.sun.com/j2se/1.4.1/download.html Synchronized Multimedia Integration Language (SMIL) SMIL is Synchronized Multimedia Integration Language (pronounced smile). It is an XML language to describe dynamic (temporal) descriptions of events in a multimedia presentation. It uses time containers, in sequence or in parallel or with exclusive actions related to contained objects. It does not have any descriptions for specific animations. This currently uses the same equivalent pieces to construct animations and narration as is used with MS Agent (the series of graphic format images and the sound files). This is an easy to use language, but requires individual lip-synching of images of the phonemes along with the audio. Synchronizing with sounds (as in singing) is available where the tone is played, for the same duration as the image containing the corresponding mouth shape. References: SMIL Animation W3C Recommendation 04-September-2001: http://www.w3.org/TR/smil-animation/ MPEG-4 MPEG-4 (formally ISO/IEC international standard 14496) defines a multimedia system for interoperable communication of complex scenes containing audio, video, synthetic audio and graphics material. These standards are used to make interactive video on CD-ROM, DVD and Digital Television. MPEG-4 builds on the proven success of three fields: Digital television; Interactive graphics applications (synthetic content); Interactive multimedia (World Wide Web, distribution of and access to content) MPEG-4 provides the standardized technological elements enabling the integration of the production, distribution and content access paradigms of the three fields. References: Stefano Battista, Franco Casalino, and Claudio Lande "MPEG-4 A Multimedia Standard for the Third Millennium" IEEE Multimedia, October 1999. http://www.computer.org/multimedia/mu1999/pdf/u4074.pdf J. Ostermann, E. Haratsch, "An animation definition interface: Rapid design of MPEG-4 compliant animated faces and bodies", International Workshop on synthetic - natural hybrid coding and three dimensional imaging, pp.216-219, Rhodes, Greece, September 59, 1997. < http://www.research.att.com/~osterman/AnimatedHead/> 5 XML AND INTERFACE AGENTS Virtual Reality Modeling Language (VRML) Virtual Reality Modeling Language is the International Standard (ISO/IEC 14772) file format for describing interactive 3D multimedia on the Internet. This is a scripting language and is not XML based. It uses a Viewer to see the .wrl formatted files. Groups such as avatardom.com have tutorials in constructing avatars using VRML. References: VRML97, as International Standard ISO/IEC 14772, http://www.web3d.org/Specifications/ Tools, resources and demos for VRML, Java3D and X3D http://www.web3d.org/vrml/vrml.htm Examples: http://www.web3d.org/Specifications/VRML97/part1/examples.html Interactive demo: http://www.web3d.org/Specifications/VRML97/part1/exampleD.5.wrl Face Modeling Language (FML) This language is used in the "ShowFace" application. The language is a hierarchical structure for face components, dynamic behavior and event handling. This application has the benefits of automatically generating images that have not existed before and would otherwise require manual construction. This language uses the MPEG-4 Facial Animation Parameters (FAPs) and incorporates the timing and sequencing from SMIL to create different aspects of the language. References: Arya, Ali and Babak Hamidzadeh, "An XML-Based Language for Face Modeling and Animation," Dept. of Electrical and Computer Engineering, University of British Columbia. Pages 1- 6. http://www.ece.ubc.ca/~alia/Multimedia/viip.pdf Virtual Human Markup Language (VHML) Virtual Human Markup Language is an XML based language to control the web presence of virtual humans. It is designed to describe Facial Animation, Body Animation, Dialogue Manager interaction, Text to Speech production, Emotional Representation plus Hyper- and Multi-Media information, with each of these handled by a subsystem markup language. The subsystem language uses XML Namespaces for inheritance of existing standards. This language allows the Talking Heads to be controlled by markup in XML. It is being used in Talking Heads (TH) programs such as MetaFace, and "Mentor System." The language is XML/XSL based and consists of the following sub-systems: Artificial Intelligence Markup Language (AIML) Body Animation Markup Language (BAML) Dialogue Manager Markup Language (DMML) Emotion Markup Language (EML) Facial Animation Markup Language (FAML) HyperText Markup Language (HTML) Speech Markup Language (SML or Sable) 6 XML AND INTERFACE AGENTS This framework is used in projects like the interFace Project. A number of frameworks are used to visualize the XML, such as, the "MetaFace" framework. This is a combination of many technologies designed to bring anthropomorphic (human-like) interaction to websites. References: VHML Specification: < http://www.vhml.org/> Tools and Links for VHML: http://www.metaface.computing.edu.au/tools/tools.html Interface Project home page: < http://www.ist-interface.org/intro.htm> Demo of interFace: < http://www.medialab.tfe.umu.se/interface/index.htm > Marriott, Andrew,"VHML- Virtual Human Markup Language", School of Computing, Curtin University of Technology, 2001, http://www.talkingheads.computing.edu.au/documents/workshops/TalkingHeadTechnolo gyWorkshop/workshop/marriott/vhml_workshop.pdf Speech Recognition A few languages, such as Speech Recognition, TalkML and Jspeech, allow for the development of the grammar to determine the content of speech. They are also used to construct the verbal responses to be used in dialogs based on received input. References: Speech Recognition Grammar Specification for the W3C Speech Interface Framework W3C Working Draft 3 January 2001< http://www.w3.org/TR/2001/WD-speechgrammar-20010103/> TalkML < http://www.w3.org/Voice/TalkML/> JSpeech Specifications http://www.w3.org/TR/jsgf/ Voice Integration The integration of text-to-speech (TTS) synthesis with the animation, referred to as a Visual TTS (VTTS), allows for the generation of visual-human-computer interfaces using agents or avatars. The basic mechanism occurs as follows: 1. The TTS informs the "talking head" when phonemes are spoken. 2. The appropriate mouth shapes are animated and rendered. 3. The TTS produces the sound. Instructions to talking head are sequenced with requests for animations and dialog. Reference: Ostermann, Jörn, Mark Beutnagel, Ariel Fischer, Yao Wang, "Integration of Talking Heads and Text-to-Speech Synthesizers for Visual TTS", http://www.research.att.com/~osterman/AnimatedHead/Icslp/icslp.html 7 XML AND INTERFACE AGENTS Behavioral Expression Animation Toolkit (BEAT) BEAT is a set of tools including a knowledge base that facilitates the automatic generation of selection choices for expressions and gestures. The KB also controls the synchronization of facial expressions. Reference: Cassell, J, Hannes Vilhjálmsson, and Timothy Bickmore, "BEAT: the behavioral Expression Animation Toolkit," Proc ACM SIGGRAPH, 2001, http://gn.www.media.mit.edu/groups/gn/pubs/siggraph2001.final.PDF Agent Control It is possible to generate the "agent controls" in XML for the MS agent that is translated into code similar to the VB script in the HTML pages, using the existing voices/wave files and issuing requests for animations and speech. The benefit would be one location for the agent specifications. There are a number of options for construction of an agent. Microsoft has an agent that is an ActiveX object that uses a series of images as animations, voice recognition and text-to-speech synthesis with lip-synching. Agent Definition Format General Magic, Inc. created an Agent Definition Format (ADF) scripting language to use XML. Their language addresses the characteristics of personalized, continuously running, semi-autonomous and communication. Programming tags are used for variables (cells) and procedures (handlers). This encompasses the interface as well as the autonomous agents. Reference: Lange, Danny B., tom Hill and Mitsuru Oshima , "A New Internet Agent Scripting Language Using XML", General Magic, Inc. http://www.moe-lange.com/danny/aiec.pdf 8