With thanks to Jim Larson From Voice Browsers to Multimodal Systems The W3C Speech Interface Framework http://www.w3.org/Voice Dave Raggett W3C Lead for Voice/Multimodal W3C & Openwave dsr@w3.org W3C AC/WWW10 Hong Kong May 2001 1/41 Voice – The Natural Interface available from over a billion phones • Personal assistant functions: – – – – • Voice Portals – • Access to news, information, entertainment, customer service and V-commerce (e.g. Find a friend, Wine Tips, Flight info, Find a hotel room , Buy ringing tones, Track a shipment) Front-ends for Call Centers – – – W3C AC/WWW10 Hong Kong May 2001 Name dialing and Search Personal Information Management Unified Messaging (mail, Fax & IM) Call screening & call routing 90% cost savings over human agents Reduced call abandonment rates (IVR) Increased customer satisfaction 2/41 (Portal Demo) W3C Voice Browser Working Group http://www.w3.org/Voice/Group • Founded: May 1999 following workshop in October 1998 • Mission – Prepare and review markup languages to enable Internet-based speech applications • Has published requirements and specifications for languages in the W3C Speech Interface Framework • Is now due to be re-chartered with clarified IP policy W3C AC/WWW10 Hong Kong May 2001 3/41 Voice Browser WG Membership Alcatel AnyDevice Ask Jeeves AT&T Avaya BeVocal Brience BT Canon Cisco Comverse Conversay EDF France Telecom General Magic W3C AC/WWW10 Hong Kong May 2001 Hitachi HP IBM Informio Intel IsSound Lernout & Hauspie Locus Dialogue Lucent Microsoft Milo Mitre Motorola Nokia Nortel Networks 4/41 Nuance Philips Openwave PipeBeach SpeechHost SpeechWorks Sun Microsystems Telecom Italia Telera Tellme Unisys Verascape VoiceGenie Voxeo VoxSurf Yahoo W3C Speech Interface Framework N-gram Grammar ML Natural Language Semantics ML Speech Recognition Grammar ML ASR Language Understanding VoiceXML 2.0 Context Interpretation World Wide Web DTMF Tone Recognizer Dialog Manager Lexicon User Prerecorded Audio Player TTS Media Planning Language Generation Speech Synthesis ML W3C AC/WWW10 Hong Kong May 2001 Reusable Components 5/41 Telephone System Call Control W3C Speech Interface Framework Published Documents Documents available at http://www.w3.org/Voice REC PR CR LCWD WD REQ Soon 1-01 1-01 Soon 12-99 12-99 12-99 5-00 12-99 12-99 12-99 12-99 12-99 5-00 2-01 4-01 Dialog Speech Speech N-gram NL Reusable Lexicon Call Synthesis Grammar Semantics Comp'ts Control W3C AC/WWW10 Hong Kong May 2001 6/41 Voice User Interfaces and VoiceXML • Why use voice as a user interface? – Far more phones than PCs – More wireless phones than PCs – Hands and eyes free operation • Why do we need a language for specifying voice dialogs? – High-level language simplifies application development – Separates Voice interface from Application server – Leverage existing Web application development tools • What does VoiceXML describe? – Conversational dialogs: System and user turns to speak – Dialogs based on form-filling metaphor plus events and links • W3C is standardizing VoiceXML based upon VoiceXML 1.0 submission by AT&T, IBM, Lucent and Motorola W3C AC/WWW10 Hong Kong May 2001 7/41 VoiceXML Architecture Brings the power of the Web to Voice VoiceXML Gateway Any Phone Consumer or Corporate Web site PSTN or VoIP VoiceXML Grammars Audio files Speech + DTMF W3C AC/WWW10 Hong Kong May 2001 Corporation Carrier 8/41 Reaching Out to Multiple Channels Applications Database XML, Images, Audio, … Content Adaptation XHTML W3C AC/WWW10 Hong Kong May 2001 VoiceXML 9/41 Adjust as needed for each device & user WML/HDML VoiceXML Features • Menus, Forms, Sub-dialogs • Events – <menu>, <form>, <subdialog> • Inputs – Speech Recognition <grammar> – Recording <record> – Keypad <dtmf> • Output – Audio files <audio> – Text-To-Speech <nomatch>, <noinput>, <help>, <catch>, <throw> • Transition & submission – <goto>, <submit> – Telephony – Call transfer – Telephony information – Platform – Objects • Variables – Performance – <var>, <script> W3C AC/WWW10 Hong Kong May 2001 – – Fetch 10/41 Example VoiceXML <menu> <prompt> <speak> Welcome to Ajax Travel. Do you want to fly to <emphasis> New York </emphasis> or <emphasis> Washington </emphasis> </speak> </prompt> <choice next="http://www.NY...".> <grammar> <choice> <item> New York </item> <item> Big Apple </item> </choice> </grammar> </choice> <choice next="http://www.Wash..."> <grammar> <choice> <item> Washington </item> <item> The Capital </item> </choice> </grammar> </choice> </menu> W3C AC/WWW10 Hong Kong May 2001 11/41 Example VoiceXML <form id="weather_info"> <block>Welcome to the international weather service.</block> <field name=“country"> <prompt>What country?</prompt> <grammar src=“country.gram" type="application/x-jsgf"/> <catch event="help"> Please say the country for which you want the weather. </catch> </field> <field name="city"> <prompt>What city?</prompt> <grammar src="city.gram" type="application/x-jsgf"/> <catch event="help"> Please say the city for which you want the weather. </catch> </field> <block> <submit next="/servlet/weather" namelist="city country"/> </block> </form> W3C AC/WWW10 Hong Kong May 2001 12/41 VoiceXML Implementations See http://www.w3.org/Voice • • • • • • • • • • • • BeVocal General Magic HeyAnita IBM Lucent Motorola Nuance PipeBeach SpeechWorks Telera Tellme Voice Genie These are the companies who asked to be listed on the W3C Voice page W3C AC/WWW10 Hong Kong May 2001 13/41 Reusable Components Voice Application Developer Voice Application Developer Reusable Components Dialog Manager VoiceXML Scripts W3C AC/WWW10 Hong Kong May 2001 14/41 Reusable Dialog Modules • Express application at task level rather than interaction level • Save development time by reusing tried and effective modules • Increase consistency among applications Examples include: Credit card number Date Name Address Telephone number Yes/No question W3C AC/WWW10 Hong Kong May 2001 Shopping cart Order status Weather Stock quotes Sport scores Word games 15/41 Speech Grammar ML • Specifies the words and patterns of words for which a speaker independent recognizer can listen • May be specified – Inline as part of a VoiceXML page – Referenced and stored separately on Web servers • Three variants: XML, ABNF, N-Gram • Action Tags for “semantic processing” W3C AC/WWW10 Hong Kong May 2001 16/41 Three forms of the Grammar ML <rule id="state" scope="public"> <one-of> <item> Oregon </item> <item>Maine </item> </one-of> </rule> public $state = Oregon | Maine W3C AC/WWW10 Hong Kong May 2001 • XML – Modeled after Java Speech Grammar Format – Mandatory for Dialog ML interpreters – Manually specified by developer • Augmented BNF syntax (ABNF) – Modeled after Java Speech Grammar Format – Optional for Dialog ML interpreters – May be mapped to and from XML grammars – Manually specified by developer • N-grams – Optional for Dialog ML interpreters – Used for larger vocabularies – Generated statistically 17/41 Action Tags • Specify what VoiceXML variables to set when grammar rules are matched to user input • Based upon subset of ECMAScript $drink = coke | pepsi | coca cola {"coke"}; // medium is default if nothing said $size = {"medium"} [small | medium | large | regular {"medium"}] W3C AC/WWW10 Hong Kong May 2001 18/41 N-Gram Language Models • Likelihood of a given word following certain others • Used as a linguistic model to identify most likely sequence of words that matches the spoken input • N-Grams are computed automatically from a corpus of many inputs • The N-Gram Markup Language is used as interchange format for automatic analysis of words and phrases to an dictation ASR engine. W3C AC/WWW10 Hong Kong May 2001 19/41 Speech synthesis process modeled after Sun’s Java Speech Markup Language Text Normalization Structure Analysis Text-toPhoneme Conversion IN • Waveform Production OUT Dr. Jones lives at 175 Park Dr. • He weighs 175 lb. He plays bass in a blues band. He also likes to fish; last week he caught a 20 lb. bass. W3C AC/WWW10 Hong Kong May 2001 Prosody Analysis 20/41 Doctor Jones lives at one seventyfive Park Drive. He weighs one hundred and seventy-five pounds. He plays base in a blues band. He likes to fish; last week he caught a twenty-pound bass. Speech Synthesis ML Structure Analysis Non-markup behavior: infer structure by automated text analysis Markup support: paragraph, sentence W3C AC/WWW10 Hong Kong May 2001 Text Normalization Text-toPhoneme Conversion Prosody Analysis Waveform Production <paragraph> <sentence> This is the first sentence. </sentence> <sentence> This is the second sentence. </sentence> </paragraph> 21/41 Speech Synthesis ML Structure Analysis Text Normalization Text-toPhoneme Conversion Examples <sayas sub="World Wide Web Consortium" > W3C</sayas> <sayas type="number:digits"> 175 </sayas> W3C AC/WWW10 Hong Kong May 2001 22/41 Prosody Analysis Waveform Production Non-markup behavior: automatically identify and convert constructs Markup support: sayas for dates, times, etc. Speech Synthesis ML Structure Analysis Phonetic Alphabets • International Phonetic Alphabet • Worldbet • X-SAMPA Text Normalization Text-toPhoneme Conversion Non-markup behavior: look up in a pronunciation dictionary Markup support: phoneme, sayas Prosody Analysis Waveform Production International Phonetic Alphabet (IPA) using character entities Example <phoneme alphabet="ipa" ph="t&#x252;m&#x251;to&#x28A;"> tomato</phoneme> W3C AC/WWW10 Hong Kong May 2001 23/41 Speech Synthesis ML Structure Analysis Text Normalization Text-toPhoneme Conversion Examples <emphasis> Hi </emphasis> <break time="3s"/> <prosody rate="slow"/> Prosody element pitch: high, medium, low, default contour range: high, medium, low, default rate: fast medium, slow, default volume: silent, soft medium, loud, default W3C AC/WWW10 Hong Kong May 2001 24/41 Prosody Analysis Waveform Production Non-markup behavior: automatically generates prosody through analysis of document structure and sentence syntax Markup support: emphasis, break, prosody Speech Synthesis ML Structure Analysis Text Normalization Text-toPhoneme Conversion Prosody Analysis Examples <audio src=“laughter.wav">[laughter]</audio> <voice age="child"> Mary had a little lamb </voice> Attributes gender: male, female, neutral age: child, teenager, adult, elder, (integer) variant: different, (integer) name: default, (voice-name) W3C AC/WWW10 Hong Kong May 2001 25/41 Waveform Production Markup support: voice, audio LexiconML - Why? •Accurate pronunciations are essential in EVERY speech application •Platform default lexicons do not give 100% coverage of user speech Voice Application Developer either TTS /ay th r/ <lexicon> either /iy th r/ either /ay th r/ </lexicon> Pronunciation Lexicon W3C AC/WWW10 Hong Kong May 2001 26/41 either ASR /ay th r/ /iy th r/ LexiconML - Key Requirements • Meets both synthesis and recognition requirements • Pronunciations for any language (including tonal) – reuse standard alphabets, support for suprasegmentals • Multiple pronunciations per word • Alternate orthographies – Spelling variations — “colour” and “color” – Alternative writing systems —Japanese Kanji and Kana – Abbreviations and Acronyms - e.g. Dr., BT, • Homophones e.g “read” and “reed” (same sound) • Homographs e.g. “read” and “read” (same spelling) W3C AC/WWW10 Hong Kong May 2001 27/41 Interaction Style • Voice user interfaces needn't be dull • Choose prompts to reflect an explicit choice of personality • Introduce variety in prompts rather than always repeating the same thing • Politeness, helpfulness and sense of humor • Target different groups of users e.g. Gen Y • Allow users to select personality (skin) W3C AC/WWW10 Hong Kong May 2001 28/41 (Personality Demo) Call Control Voice Application Developer Voice XML User W3C AC/WWW10 Hong Kong May 2001 Call Control Dialog Manager 29/41 (Call control Demo) Call Control Requirements • Call management—Place outbound call, conditionally answer inbound call, outbound fax • Call leg management—Create, redirect, interact while on hold • Conference management—Create, join, exit • Intersession communication—Asynchronous events • Interpreter context—Invoke, terminate W3C AC/WWW10 Hong Kong May 2001 30/41 Natural Language Semantics ML Voice Application Developer Grammar and semantic tags ASR W3C AC/WWW10 Hong Kong May 2001 Text Language Understanding 31/41 NL Semantics Context Interpretation Natural Language Semantics ML • Represent semantic interpretations of an utterance – Speech – Natural language text – Other forms (e.g., handwriting, ocr, DTMF.) • Used primarily as an interchange format among voice browser components • Usually generated automatically and not authored directly by developers • Goal is to use XForms as a data model W3C AC/WWW10 Hong Kong May 2001 32/41 NLSemantics ML structure grammar x-model xmlns Result Interpretation Incoming data Meaning Input Text Nomatch confidence grammar x-model xmlns Noinput mode timestamp-start timestamp-end confidence Xforms definition 33/41 xf:instance Application-specific elements defined by X Forms data model Input Text W3C AC/WWW10 Hong Kong May 2001 xf:model What toppings do you have? <interpretation grammar="http://toppings" xmlns:xf="http://www.w3.org/xxx“> <input mode="speech">what toppings to you have?</input> <xf:x-model> <xf: group xf:name="question"/> <xf:string xf:name="questioned_item"/> <xf: string xf:name="questioned_property"/> </xf:group> </xf:x-model> <xf: instance> <app:question> <app:questioned-item>toppings</app:questioned_item> <app:questioned_property>availability</app:questioned_property> </app:question> </xf:instance> </interpretation> W3C AC/WWW10 Hong Kong May 2001 34/41 Richer Natural Language • Most current voice apps restrict users to keywords or short phrases • The application does most of the talking • Alternative is to use open grammars with word spotting and let user do the talking • Rules for figuring out what the user said and why as basis for asking next question W3C AC/WWW10 Hong Kong May 2001 35/41 (GM/AskJeeves Demo) Multimodal = Voice + Displays What is the weather in San Francisco? • Say which City you want weather for and see the information on your phone • Say which bands/CD’s you want to buy and confirm the choices visually W3C AC/WWW10 Hong Kong May 2001 36/41 I want to place an order for “Hotshot” by Shaggy. Multimodal Interaction • Multimodal applications – – • • • • • Voice + Display + Key pad + Stylus etc. User is free to switch between voice interaction and use of display/key pad/clicking/handwriting July 2000 Published Multimodal Requirements Draft Demonstrations of Multimodal prototypes at Paris face to face meeting of Voice Browser WG Joint W3C/WAP Forum workshop on Multimodal – Hong Kong September 2000 February 2001 – W3C publishes Multimodal Request for Proposals Plan to set up Multimodal Working Group later this year assuming we get appropriate submission(s) W3C AC/WWW10 Hong Kong May 2001 37/41 Multimodal Interaction • Primary market is mobile wireless – cell phones, personal digital assistants and cars • Timescale is driven by deployment of 3G networks • Input modes: – speech, keypads, pointing devices, and electronic ink • Output modes: – speech, audio, and bitmapped or character cell displays • Architecture should allow for both local and remote speech processing W3C AC/WWW10 Hong Kong May 2001 38/41 Some Ideas … W3C is seeking detailed proposals with broad industry support as basis for chartering multimodal working group • Speech enabling XHTML (and WML) without requiring changes to markup language – • Loose coupling of VoiceXML with externally defined pages written in XHTML, SMIL, etc. – • Turn-driven synchronization protocol based on SIP? Distributed Speech Processing – – – • New ECMAScript Speech Object? Reduce load on wireless network and speech servers Increase recognition accuracy in presence of noise ETSI work on Aurora Using pen-based gestures to constrain ASR (click and speak) W3C AC/WWW10 Hong Kong May 2001 39/41 VoiceXML IP Issues • Technical work on VoiceXML 2.0 is proceeding well • Publication of VoiceXML 2.0 working draft held up over IP issues (although internal version is accessible to W3C Members) • Related specifications for grammar, speech synthesis, natural language synthesis, lexicon, and call control have or shortly will be published. • W3C and VoiceXML Forum Management are in process of developing a formal Memorandum of Understanding • W3C is convening a Patent Advisory Group to recommend IP Policy for re-chartering the Voice Browser Activity – Draw inspiration from IETF, ECTF, ETSI and other bodies, e.g. require all WG members to license essential IP under openly specified RAND terms with operational criteria for effective terms expressed in terms of exit criteria for Candidate Recommendation phase. No requirement for advanced disclosure of IP W3C AC/WWW10 Hong Kong May 2001 40/41 Discussion? W3C AC/WWW10 Hong Kong May 2001 41/41