XML and Interface Agents

advertisement
XML AND INTERFACE AGENTS
R. Gabrielle Reed
DIS - EDF 5906
December 11, 2002
Table of Contents
INTRODUCTION TO XML ............................................................................................................................ 1
IS THERE AN XML AGENTS MARKUP LANGUAGE? .................................................................................... 1
MS AGENT ................................................................................................................................................. 3
XML GRAPHIC TOOLS ............................................................................................................................... 3
FLASH AND SWF ........................................................................................................................................ 4
SCALABLE VECTOR GRAPHICS (SVG) ........................................................................................................ 4
SYNCHRONIZED MULTIMEDIA INTEGRATION LANGUAGE (SMIL).............................................................. 5
MPEG-4 ..................................................................................................................................................... 5
VIRTUAL REALITY MODELING LANGUAGE (VRML) ................................................................................. 6
FACE MODELING LANGUAGE (FML).......................................................................................................... 6
VIRTUAL HUMAN MARKUP LANGUAGE (VHML) ...................................................................................... 6
SPEECH RECOGNITION ................................................................................................................................ 7
VOICE INTEGRATION .................................................................................................................................. 7
BEHAVIORAL EXPRESSION ANIMATION TOOLKIT (BEAT) ......................................................................... 8
AGENT CONTROL ....................................................................................................................................... 8
AGENT DEFINITION FORMAT ...................................................................................................................... 8
Introduction to XML
XML is a markup language that allows the characteristics, attributes and functional
aspects of an object to be recorded, so that it is easily read, modified and processed. It
provides a smaller bandwidth interface compared with other graphical interfaces.
Typically, a parser reads the XML and a transformer converts the information into a
format so that it may be viewed.
Some formats are viewed directly in a browser, e.g. Hypertext Markup Language
(HTML), Graphic Interchange Format (gif) or Joint Photographic Experts Group (jpeg)
images. Others use a viewer (plug in) as is the case for many graphics such as Flash in
Small Web Format(SWF), Synchronized Multimedia Integration Language (SMIL),
Scalable Vector Graphics (SVG), and Virtual Reality Modeling Language (VRML).
Systems and Languages are also based on the Moving Picture Expert Group 4
(MPEG-4) standard, which is used for graphics, audio, video, delivery and control of the
animation. Some of the XML languages require a framework to assemble the parts and
function within a server, such as the Virtual Human Markup Language (VHML) used to
integrate the dynamic animation with speech functions. Both MPEG and VHML have
extensive pieces and plays different roles in the generation of the interface presence.
There are systems with an integration of both.
Is There an XML Agents Markup Language?
XML agents are typically two types. One type is a program that runs behind the scene,
and communicates to services and other agents to meet a particular goal. The other kind
of agent is an agent with a personality, in some cases an Advanced Video Attribute
1
XML AND INTERFACE AGENTS
Terminal Assembler and Recreator (AVATAR); it is visible, communicates and functions
as an interface to an application. This paper addresses the interface agent.
The anatomy (attributes and functionality) of an interface agent includes:
1. Agent controls
2. Visual appearance
3. Behavior such as animation including gestures and affect.
4. Voice Synthesis (text to speech recognition)
5. Balloon Text (Text Representation)
6. Speech Recognition
Table 1: List of XML Languages Used for Particular Applications in comparison to
Microsoft®(MS) Agent.
Function
MS Agent Component XML Equivalent
Agent Control
Agent Engine
XML Parsers, Transformers, Framework,
Architectures, w/ databases (DB) and
Knowledge Bases(KB)
Agent Definition Format (ADF)
Dialog Construction
Scripted statements in
Dialog Manager Markup Language (DMML),
web scripting language Artificial Intelligence Markup Language
(AIML)
Image Construction
Constructed graphic
Synchronized Multimedia Integration
and Animation
images
Language (SMIL),
Scalable Vector Graphics (SVG),
Virtual Reality Modeling Language (VRML)
Balloon Text
Agent Character Editor Speech Markup Language(SML)
Face Animation
Constructed graphic
Face Markup Language (FML),
images by storyboard
Face Animation Markup Language(FAML)
Voice
Lernout & Hauspie®
Voice Mark up Language (VoiceML),
(L&H) TruVoice Text- Voice control Markup Language (VoxML),
To-Speech Engine
Speech Markup Language (Sable) SML,
Human Animation ,
Constructed graphic
Human Markup Language(HML),
Gestures, body
images by storyboard
Gesture Markup Language(GML),
Body Animation Markup Language(BAML),
Human Animation,
Agent Character Editor Moving Picture Expert Group 4 (MPEG-4),
Voice Integration or
Virtual Human Markup Language (VHML)
Agent Construction
Speech Recognition
Microsoft® Speech
Speech Recognition Grammar Markup
Recognition Engine
(SRGM)
Talk Markup Language (TalkML)
JSpeech Grammar Format (JSGF)
The use of XML in generating the interface agent can be applied in a number of ways:
1. Agent Control to invoke agent engine, or text to speech engines, issue animation
or speech commands
2. Dialog Construction to pace and control the timing of the conversation, to play
facial expressions and to facilitate the scripting and timing of dialogs
2
XML AND INTERFACE AGENTS
3. Image Construction and Animation to include dynamic or static images for
animation generation
4. Face Animation to define the wire frame, surface, texture map, and motion of
critical points on the face
5. Voice to generate the speech and display text such as the "balloon text"
6. Human Animation to define the body, face motions, gestures and "stories," and
emotion
7. Human Animation and Voice Integration (Agent Construction) to include the
timing of delivery of text, specify the form of responses and replies, and to
integration of the animations and voice
MS Agent
The good example of an agent is the MS Agent. The MS Agent is an animation with
lip-synch capabilities. Speech Synthesis and recognition is controlled through L&H
speech engine. Behavior or Animation control using agent engine using the MS Character
file. Bubble text is defined in the character files. Voice as wave, or text uses two different
processes.
Besides the core component applications, the following are artifacts needed to
assemble an MS Agent:
1. Images of animation steps
2. Text scripts to be spoken (text to speech) or recorded scripts
3. Images of the 7 mouth shapes needed for lip-synching
Using MS Agent character editor, behaviors are made by inserting images into
functions defined as animations. For lip-synching, the mouth shapes are stored by
phoneme as “overlays." The Agent is generated and it produces the “ASC” file, which is
stored in the default “C:/Windows/MSAgent/Char/” directory.
The agent, when used on a web page, is manipulated by an agent engine, which is
loaded and remains available for requests until it is unloaded. The animations may be
played using requests to the engine. An agent engine typically will receive requests from
VBScript or JavaScript to play animation or to speak a phrase. The agent sends requests
to the speech synthesizer to speak a particular text. The agent engine also processes the
speak request to animate the sequenced lip-synching according to the words in the text.
Reference:
Microsoft Agents: http://www.microsoft.com/msagent/
XML Graphic Tools
Graphic tools are used for animation. These are languages such as Synchronized
Multimedia Integration Language (SMIL), and Scalable Vector Graphics (SVG).
Animation categories of constructions currently available are puppets, talking heads and
virtual actors (avatars). Some also can be constructed using Virtual Reality Modeling
Language (VRML), Java3D and X3D.
Reference:
3
XML AND INTERFACE AGENTS
Graphic links: http://graphics.stanford.edu/~bregler/anim_links/
Flash and SWF
Some animation applications use Small Web Format (SWF), which is a movie file
format from Macromedia. SWF is used in Macromedia Flash to deliver graphics,
animation and sound over the Internet. This has become the medium of choice for many
publishers of multimedia. Flash has the ability to:
 Create and mask vector art
 Implement transition effects within a movie clip
 Incorporate dynamic text
 Create buttons and add navigation
 Implement stream and event sounds
According to Macromedia, Flash has a small player, which gives it a wider
distribution. Flash is included in every Netscape download.
References:
Open SWF file information: http://www.openswf.org/
Macromedia Flash Developer Tutorial on “Pal2”:
C:\Program Files\Macromedia\Flash MX\Help\Flash\ContextHelp_tut1.htm
http://www.macromedia.com/support/general/ts/documents/sw_flash_differences.htm
Scalable Vector Graphics (SVG)
SVG is a language used for describing two-dimensional graphics in XML. SVG
allows for three types of graphic objects: vector graphic shapes (e.g., paths consisting of
straight lines and curves), images and text. Graphical objects can be grouped, styled,
transformed and composed into previously rendered objects. The feature set includes
nested transformations, clipping paths, alpha masks, filter effects and template objects.
SVG drawings can be interactive with mouse-overs and on-clicks. It is also dynamic
by including timing and transformations. Animations can be defined and triggered either
declaratively (i.e., by embedding SVG animation elements in SVG content) or via
scripting. Animations are constructed by using images, basic shapes, canvases and fill.
Adobe Illustrator 10 can create the images and save them as SVG. Animations can be
timespliced; Animations may be represented by an object moving along a path during
each time period.
Batik is the open source SVG tool from Apache.org. The static SVG tags are
supported. The animation and dynamic tags are being developed. As of August 29, 2002,
interactivity is limited to text events and keyboard events. Descriptions for Motion are
undergoing testing. To see the Batik demos, one needs to install the Java Runtime
environment.
References:
Scalable Vector Graphics (SVG) 1.0 Specification: < http://www.w3.org/TR/SVG/>
SVG 1.0 Animation specifications < http://www.w3.org/TR/SVG/animate.html >
4
XML AND INTERFACE AGENTS
A selection of simple SVG Examples:
http://www.xmlpitstop.com/Default.asp?DataType=SVGEXAMPLES
Adobe SVG Viewer Adobe SVG Viewer:
http://www.adobe.com/svg/viewer/install/main.html
Adobe demos: http://www.adobe.com/svg/demos/main.html
Batik Info and Demos: http://xml.apache.org/batik/
Java Runtime environment: http://java.sun.com/j2se/1.4.1/download.html
Synchronized Multimedia Integration Language (SMIL)
SMIL is Synchronized Multimedia Integration Language (pronounced smile). It is an
XML language to describe dynamic (temporal) descriptions of events in a multimedia
presentation. It uses time containers, in sequence or in parallel or with exclusive actions
related to contained objects. It does not have any descriptions for specific animations.
This currently uses the same equivalent pieces to construct animations and narration
as is used with MS Agent (the series of graphic format images and the sound files). This
is an easy to use language, but requires individual lip-synching of images of the
phonemes along with the audio. Synchronizing with sounds (as in singing) is available
where the tone is played, for the same duration as the image containing the corresponding
mouth shape.
References:
SMIL Animation W3C Recommendation 04-September-2001:
http://www.w3.org/TR/smil-animation/
MPEG-4
MPEG-4 (formally ISO/IEC international standard 14496) defines a multimedia
system for interoperable communication of complex scenes containing audio, video,
synthetic audio and graphics material. These standards are used to make interactive video
on CD-ROM, DVD and Digital Television. MPEG-4 builds on the proven success of
three fields:
 Digital television;
 Interactive graphics applications (synthetic content);
 Interactive multimedia (World Wide Web, distribution of and access to content)
MPEG-4 provides the standardized technological elements enabling the integration of
the production, distribution and content access paradigms of the three fields.
References:
Stefano Battista, Franco Casalino, and Claudio Lande "MPEG-4 A Multimedia Standard
for the Third Millennium" IEEE Multimedia, October 1999.
http://www.computer.org/multimedia/mu1999/pdf/u4074.pdf
J. Ostermann, E. Haratsch, "An animation definition interface: Rapid design of MPEG-4
compliant animated faces and bodies", International Workshop on synthetic - natural
hybrid coding and three dimensional imaging, pp.216-219, Rhodes, Greece, September 59, 1997. < http://www.research.att.com/~osterman/AnimatedHead/>
5
XML AND INTERFACE AGENTS
Virtual Reality Modeling Language (VRML)
Virtual Reality Modeling Language is the International Standard (ISO/IEC 14772) file
format for describing interactive 3D multimedia on the Internet. This is a scripting
language and is not XML based. It uses a Viewer to see the .wrl formatted files. Groups
such as avatardom.com have tutorials in constructing avatars using VRML.
References:
VRML97, as International Standard ISO/IEC 14772,
http://www.web3d.org/Specifications/
Tools, resources and demos for VRML, Java3D and X3D
http://www.web3d.org/vrml/vrml.htm
Examples: http://www.web3d.org/Specifications/VRML97/part1/examples.html
Interactive demo: http://www.web3d.org/Specifications/VRML97/part1/exampleD.5.wrl
Face Modeling Language (FML)
This language is used in the "ShowFace" application. The language is a hierarchical
structure for face components, dynamic behavior and event handling. This application
has the benefits of automatically generating images that have not existed before and
would otherwise require manual construction. This language uses the MPEG-4 Facial
Animation Parameters (FAPs) and incorporates the timing and sequencing from SMIL to
create different aspects of the language.
References:
Arya, Ali and Babak Hamidzadeh, "An XML-Based Language for Face Modeling and
Animation," Dept. of Electrical and Computer Engineering, University of British
Columbia. Pages 1- 6. http://www.ece.ubc.ca/~alia/Multimedia/viip.pdf
Virtual Human Markup Language (VHML)
Virtual Human Markup Language is an XML based language to control the web
presence of virtual humans. It is designed to describe Facial Animation, Body Animation,
Dialogue Manager interaction, Text to Speech production, Emotional Representation plus
Hyper- and Multi-Media information, with each of these handled by a subsystem markup
language. The subsystem language uses XML Namespaces for inheritance of existing
standards.
This language allows the Talking Heads to be controlled by markup in XML. It is
being used in Talking Heads (TH) programs such as MetaFace, and "Mentor System."
The language is XML/XSL based and consists of the following sub-systems:
 Artificial Intelligence Markup Language (AIML)
 Body Animation Markup Language (BAML)
 Dialogue Manager Markup Language (DMML)
 Emotion Markup Language (EML)
 Facial Animation Markup Language (FAML)
 HyperText Markup Language (HTML)
 Speech Markup Language (SML or Sable)
6
XML AND INTERFACE AGENTS
This framework is used in projects like the interFace Project. A number of
frameworks are used to visualize the XML, such as, the "MetaFace" framework. This is a
combination of many technologies designed to bring anthropomorphic (human-like)
interaction to websites.
References:
VHML Specification: < http://www.vhml.org/>
Tools and Links for VHML: http://www.metaface.computing.edu.au/tools/tools.html
Interface Project home page: < http://www.ist-interface.org/intro.htm>
Demo of interFace: < http://www.medialab.tfe.umu.se/interface/index.htm >
Marriott, Andrew,"VHML- Virtual Human Markup Language", School of Computing,
Curtin University of Technology, 2001,
http://www.talkingheads.computing.edu.au/documents/workshops/TalkingHeadTechnolo
gyWorkshop/workshop/marriott/vhml_workshop.pdf
Speech Recognition
A few languages, such as Speech Recognition, TalkML and Jspeech, allow for the
development of the grammar to determine the content of speech. They are also used to
construct the verbal responses to be used in dialogs based on received input.
References:
Speech Recognition Grammar Specification for the W3C Speech Interface Framework
W3C Working Draft 3 January 2001< http://www.w3.org/TR/2001/WD-speechgrammar-20010103/>
TalkML < http://www.w3.org/Voice/TalkML/>
JSpeech Specifications http://www.w3.org/TR/jsgf/
Voice Integration
The integration of text-to-speech (TTS) synthesis with the animation, referred to as a
Visual TTS (VTTS), allows for the generation of visual-human-computer interfaces using
agents or avatars. The basic mechanism occurs as follows:
1. The TTS informs the "talking head" when phonemes are spoken.
2. The appropriate mouth shapes are animated and rendered.
3. The TTS produces the sound.
Instructions to talking head are sequenced with requests for animations and dialog.
Reference:
Ostermann, Jörn, Mark Beutnagel, Ariel Fischer, Yao Wang, "Integration of Talking
Heads and Text-to-Speech Synthesizers for Visual TTS",
http://www.research.att.com/~osterman/AnimatedHead/Icslp/icslp.html
7
XML AND INTERFACE AGENTS
Behavioral Expression Animation Toolkit (BEAT)
BEAT is a set of tools including a knowledge base that facilitates the automatic
generation of selection choices for expressions and gestures. The KB also controls the
synchronization of facial expressions.
Reference:
Cassell, J, Hannes Vilhjálmsson, and Timothy Bickmore, "BEAT: the behavioral
Expression Animation Toolkit," Proc ACM SIGGRAPH, 2001,
http://gn.www.media.mit.edu/groups/gn/pubs/siggraph2001.final.PDF
Agent Control
It is possible to generate the "agent controls" in XML for the MS agent that is
translated into code similar to the VB script in the HTML pages, using the existing
voices/wave files and issuing requests for animations and speech. The benefit would be
one location for the agent specifications.
There are a number of options for construction of an agent. Microsoft has an agent
that is an ActiveX object that uses a series of images as animations, voice recognition and
text-to-speech synthesis with lip-synching.
Agent Definition Format
General Magic, Inc. created an Agent Definition Format (ADF) scripting language to
use XML. Their language addresses the characteristics of personalized, continuously
running, semi-autonomous and communication. Programming tags are used for variables
(cells) and procedures (handlers). This encompasses the interface as well as the
autonomous agents.
Reference:
Lange, Danny B., tom Hill and Mitsuru Oshima , "A New Internet Agent Scripting
Language Using XML", General Magic, Inc. http://www.moe-lange.com/danny/aiec.pdf
8
Download