voice

advertisement
With thanks to Jim Larson
From Voice Browsers to
Multimodal Systems
The W3C Speech Interface Framework
http://www.w3.org/Voice
Dave Raggett
W3C Lead for Voice/Multimodal
W3C & Openwave
dsr@w3.org
W3C AC/WWW10
Hong Kong May 2001
1/41
Voice – The Natural Interface
available from over a billion phones
•
Personal assistant functions:
–
–
–
–
•
Voice Portals
–
•
Access to news, information, entertainment,
customer service and V-commerce
(e.g. Find a friend, Wine Tips, Flight info, Find a
hotel room , Buy ringing tones, Track a shipment)
Front-ends for Call Centers
–
–
–
W3C AC/WWW10
Hong Kong May 2001
Name dialing and Search
Personal Information Management
Unified Messaging (mail, Fax & IM)
Call screening & call routing
90% cost savings over human agents
Reduced call abandonment rates (IVR)
Increased customer satisfaction
2/41
(Portal Demo)
W3C Voice Browser Working Group
http://www.w3.org/Voice/Group
• Founded: May 1999 following workshop in October 1998
• Mission
– Prepare and review markup languages to enable Internet-based
speech applications
• Has published requirements and specifications for
languages in the W3C Speech Interface Framework
• Is now due to be re-chartered with clarified IP policy
W3C AC/WWW10
Hong Kong May 2001
3/41
Voice Browser WG Membership
Alcatel
AnyDevice
Ask Jeeves
AT&T
Avaya
BeVocal
Brience
BT
Canon
Cisco
Comverse
Conversay
EDF
France Telecom
General Magic
W3C AC/WWW10
Hong Kong May 2001
Hitachi
HP
IBM
Informio
Intel
IsSound
Lernout & Hauspie
Locus Dialogue
Lucent
Microsoft
Milo
Mitre
Motorola
Nokia
Nortel Networks
4/41
Nuance
Philips
Openwave
PipeBeach
SpeechHost
SpeechWorks
Sun Microsystems
Telecom Italia
Telera
Tellme
Unisys
Verascape
VoiceGenie
Voxeo
VoxSurf
Yahoo
W3C Speech Interface Framework
N-gram Grammar ML
Natural Language
Semantics ML
Speech Recognition
Grammar ML
ASR
Language
Understanding
VoiceXML 2.0
Context
Interpretation
World
Wide
Web
DTMF Tone Recognizer
Dialog
Manager
Lexicon
User
Prerecorded Audio Player
TTS
Media
Planning
Language
Generation
Speech Synthesis ML
W3C AC/WWW10
Hong Kong May 2001
Reusable Components
5/41
Telephone
System
Call Control
W3C Speech Interface Framework
Published Documents
Documents available at http://www.w3.org/Voice
REC
PR
CR
LCWD
WD
REQ
Soon
1-01
1-01
Soon
12-99
12-99
12-99
5-00
12-99
12-99
12-99
12-99
12-99
5-00
2-01
4-01
Dialog Speech Speech N-gram NL
Reusable Lexicon Call
Synthesis Grammar
Semantics Comp'ts
Control
W3C AC/WWW10
Hong Kong May 2001
6/41
Voice User Interfaces and VoiceXML
• Why use voice as a user interface?
– Far more phones than PCs
– More wireless phones than PCs
– Hands and eyes free operation
• Why do we need a language for specifying voice dialogs?
– High-level language simplifies application development
– Separates Voice interface from Application server
– Leverage existing Web application development tools
• What does VoiceXML describe?
– Conversational dialogs: System and user turns to speak
– Dialogs based on form-filling metaphor plus events and links
• W3C is standardizing VoiceXML based upon VoiceXML 1.0
submission by AT&T, IBM, Lucent and Motorola
W3C AC/WWW10
Hong Kong May 2001
7/41
VoiceXML Architecture
Brings the power of the Web to Voice
VoiceXML
Gateway
Any Phone
Consumer or
Corporate Web site
PSTN or VoIP
VoiceXML
Grammars
Audio files
Speech +
DTMF
W3C AC/WWW10
Hong Kong May 2001
Corporation
Carrier
8/41
Reaching Out to Multiple
Channels
Applications Database
XML, Images, Audio, …
Content Adaptation
XHTML
W3C AC/WWW10
Hong Kong May 2001
VoiceXML
9/41
Adjust as needed for
each device & user
WML/HDML
VoiceXML Features
• Menus, Forms, Sub-dialogs
• Events
– <menu>, <form>, <subdialog>
• Inputs
– Speech Recognition <grammar>
– Recording <record>
– Keypad <dtmf>
• Output
– Audio files <audio>
– Text-To-Speech
<nomatch>, <noinput>, <help>,
<catch>, <throw>
• Transition & submission
– <goto>, <submit>
– Telephony
– Call transfer
– Telephony information
– Platform
– Objects
• Variables
– Performance
– <var>, <script>
W3C AC/WWW10
Hong Kong May 2001
–
– Fetch
10/41
Example VoiceXML
<menu>
<prompt>
<speak>
Welcome to Ajax Travel.
Do you want to fly to
<emphasis>
New York
</emphasis>
or
<emphasis>
Washington
</emphasis>
</speak>
</prompt>
<choice next="http://www.NY...".>
<grammar>
<choice>
<item> New York </item>
<item> Big Apple </item>
</choice>
</grammar>
</choice>
<choice next="http://www.Wash...">
<grammar>
<choice>
<item> Washington </item>
<item> The Capital </item>
</choice>
</grammar>
</choice>
</menu>
W3C AC/WWW10
Hong Kong May 2001
11/41
Example VoiceXML
<form id="weather_info">
<block>Welcome to the international weather service.</block>
<field name=“country">
<prompt>What country?</prompt>
<grammar src=“country.gram" type="application/x-jsgf"/>
<catch event="help">
Please say the country for which you want the weather.
</catch>
</field>
<field name="city">
<prompt>What city?</prompt>
<grammar src="city.gram" type="application/x-jsgf"/>
<catch event="help">
Please say the city for which you want the weather.
</catch>
</field>
<block>
<submit next="/servlet/weather" namelist="city country"/>
</block>
</form>
W3C AC/WWW10
Hong Kong May 2001
12/41
VoiceXML Implementations
See http://www.w3.org/Voice
•
•
•
•
•
•
•
•
•
•
•
•
BeVocal
General Magic
HeyAnita
IBM
Lucent
Motorola
Nuance
PipeBeach
SpeechWorks
Telera
Tellme
Voice Genie
These are the companies who asked to be listed on the W3C Voice page
W3C AC/WWW10
Hong Kong May 2001
13/41
Reusable Components
Voice Application
Developer
Voice Application
Developer
Reusable
Components
Dialog
Manager
VoiceXML
Scripts
W3C AC/WWW10
Hong Kong May 2001
14/41
Reusable Dialog Modules
• Express application at task level rather than interaction
level
• Save development time by reusing tried and effective
modules
• Increase consistency among applications
Examples include:
Credit card number
Date
Name
Address
Telephone number
Yes/No question
W3C AC/WWW10
Hong Kong May 2001
Shopping cart
Order status
Weather
Stock quotes
Sport scores
Word games
15/41
Speech Grammar ML
• Specifies the words and patterns of words for
which a speaker independent recognizer can listen
• May be specified
– Inline as part of a VoiceXML page
– Referenced and stored separately on Web servers
• Three variants: XML, ABNF, N-Gram
• Action Tags for “semantic processing”
W3C AC/WWW10
Hong Kong May 2001
16/41
Three forms of the Grammar ML
<rule id="state" scope="public">
<one-of>
<item> Oregon </item>
<item>Maine </item>
</one-of>
</rule>
public $state = Oregon | Maine
W3C AC/WWW10
Hong Kong May 2001
• XML
– Modeled after Java Speech
Grammar Format
– Mandatory for Dialog ML
interpreters
– Manually specified by developer
• Augmented BNF syntax (ABNF)
– Modeled after Java Speech
Grammar Format
– Optional for Dialog ML interpreters
– May be mapped to and from XML
grammars
– Manually specified by developer
• N-grams
– Optional for Dialog ML interpreters
– Used for larger vocabularies
– Generated statistically
17/41
Action Tags
• Specify what VoiceXML variables to set
when grammar rules are matched to user
input
• Based upon subset of ECMAScript
$drink = coke | pepsi | coca cola {"coke"};
// medium is default if nothing said
$size = {"medium"} [small | medium | large | regular {"medium"}]
W3C AC/WWW10
Hong Kong May 2001
18/41
N-Gram Language Models
• Likelihood of a given word following certain
others
• Used as a linguistic model to identify most likely
sequence of words that matches the spoken input
• N-Grams are computed automatically from a
corpus of many inputs
• The N-Gram Markup Language is used as
interchange format for automatic analysis of
words and phrases to an dictation ASR engine.
W3C AC/WWW10
Hong Kong May 2001
19/41
Speech synthesis process
modeled after Sun’s Java Speech Markup Language
Text
Normalization
Structure
Analysis
Text-toPhoneme
Conversion
IN
•
Waveform
Production
OUT
Dr. Jones lives at 175 Park Dr. •
He weighs 175 lb. He plays
bass in a blues band. He also
likes to fish; last week he caught
a 20 lb. bass.
W3C AC/WWW10
Hong Kong May 2001
Prosody
Analysis
20/41
Doctor Jones lives at one seventyfive Park Drive. He weighs one
hundred and seventy-five pounds.
He plays base in a blues band. He
likes to fish; last week he caught a
twenty-pound bass.
Speech Synthesis ML
Structure
Analysis
Non-markup behavior:
infer structure by
automated text analysis
Markup support:
paragraph, sentence
W3C AC/WWW10
Hong Kong May 2001
Text
Normalization
Text-toPhoneme
Conversion
Prosody
Analysis
Waveform
Production
<paragraph>
<sentence>
This is the first sentence.
</sentence>
<sentence>
This is the second sentence.
</sentence>
</paragraph>
21/41
Speech Synthesis ML
Structure
Analysis
Text
Normalization
Text-toPhoneme
Conversion
Examples
<sayas sub="World Wide Web Consortium" >
W3C</sayas>
<sayas type="number:digits"> 175 </sayas>
W3C AC/WWW10
Hong Kong May 2001
22/41
Prosody
Analysis
Waveform
Production
Non-markup behavior:
automatically identify
and convert constructs
Markup support:
sayas for dates, times, etc.
Speech Synthesis ML
Structure
Analysis
Phonetic Alphabets
• International Phonetic
Alphabet
• Worldbet
• X-SAMPA
Text
Normalization
Text-toPhoneme
Conversion
Non-markup behavior:
look up in a pronunciation
dictionary
Markup support:
phoneme, sayas
Prosody
Analysis
Waveform
Production
International Phonetic
Alphabet (IPA) using
character entities
Example
<phoneme alphabet="ipa" ph="tɒmɑtoʊ"> tomato</phoneme>
W3C AC/WWW10
Hong Kong May 2001
23/41
Speech Synthesis ML
Structure
Analysis
Text
Normalization
Text-toPhoneme
Conversion
Examples
<emphasis> Hi </emphasis>
<break time="3s"/>
<prosody rate="slow"/>
Prosody element
pitch: high, medium, low, default
contour
range: high, medium, low, default
rate: fast medium, slow, default
volume: silent, soft medium, loud, default
W3C AC/WWW10
Hong Kong May 2001
24/41
Prosody
Analysis
Waveform
Production
Non-markup behavior:
automatically generates
prosody through analysis
of document structure and
sentence syntax
Markup support:
emphasis, break, prosody
Speech Synthesis ML
Structure
Analysis
Text
Normalization
Text-toPhoneme
Conversion
Prosody
Analysis
Examples
<audio src=“laughter.wav">[laughter]</audio>
<voice age="child"> Mary had a little lamb </voice>
Attributes
gender: male, female, neutral
age: child, teenager, adult, elder, (integer)
variant: different, (integer)
name: default, (voice-name)
W3C AC/WWW10
Hong Kong May 2001
25/41
Waveform
Production
Markup support:
voice, audio
LexiconML - Why?
•Accurate pronunciations are essential in EVERY speech application
•Platform default lexicons do not give 100% coverage of user speech
Voice Application
Developer
either
TTS
/ay th r/
<lexicon>
either /iy th r/
either /ay th r/
</lexicon>
Pronunciation Lexicon
W3C AC/WWW10
Hong Kong May 2001
26/41
either
ASR
/ay th r/
/iy th r/
LexiconML - Key Requirements
• Meets both synthesis and recognition requirements
• Pronunciations for any language (including tonal)
– reuse standard alphabets, support for suprasegmentals
• Multiple pronunciations per word
• Alternate orthographies
– Spelling variations — “colour” and “color”
– Alternative writing systems —Japanese Kanji and Kana
– Abbreviations and Acronyms - e.g. Dr., BT,
• Homophones e.g “read” and “reed” (same sound)
• Homographs e.g. “read” and “read” (same spelling)
W3C AC/WWW10
Hong Kong May 2001
27/41
Interaction Style
• Voice user interfaces needn't be dull
• Choose prompts to reflect an explicit choice of
personality
• Introduce variety in prompts rather than always
repeating the same thing
• Politeness, helpfulness and sense of humor
• Target different groups of users e.g. Gen Y
• Allow users to select personality (skin)
W3C AC/WWW10
Hong Kong May 2001
28/41
(Personality Demo)
Call Control
Voice Application
Developer
Voice
XML
User
W3C AC/WWW10
Hong Kong May 2001
Call
Control
Dialog Manager
29/41
(Call control Demo)
Call Control Requirements
• Call management—Place outbound call,
conditionally answer inbound call, outbound fax
• Call leg management—Create, redirect, interact
while on hold
• Conference management—Create, join, exit
• Intersession communication—Asynchronous
events
• Interpreter context—Invoke, terminate
W3C AC/WWW10
Hong Kong May 2001
30/41
Natural Language Semantics ML
Voice Application
Developer
Grammar and semantic tags
ASR
W3C AC/WWW10
Hong Kong May 2001
Text
Language
Understanding
31/41
NL
Semantics
Context
Interpretation
Natural Language Semantics ML
• Represent semantic interpretations of an utterance
– Speech
– Natural language text
– Other forms (e.g., handwriting, ocr, DTMF.)
• Used primarily as an interchange format among
voice browser components
• Usually generated automatically and not authored
directly by developers
• Goal is to use XForms as a data model
W3C AC/WWW10
Hong Kong May 2001
32/41
NLSemantics ML structure
grammar
x-model
xmlns
Result
Interpretation
Incoming data
Meaning
Input
Text
Nomatch
confidence
grammar
x-model
xmlns
Noinput
mode
timestamp-start
timestamp-end
confidence
Xforms
definition
33/41
xf:instance
Application-specific
elements defined by
X Forms data model
Input
Text
W3C AC/WWW10
Hong Kong May 2001
xf:model
What toppings do you have?
<interpretation grammar="http://toppings" xmlns:xf="http://www.w3.org/xxx“>
<input mode="speech">what toppings to you have?</input>
<xf:x-model>
<xf: group xf:name="question"/>
<xf:string xf:name="questioned_item"/>
<xf: string xf:name="questioned_property"/>
</xf:group>
</xf:x-model>
<xf: instance>
<app:question>
<app:questioned-item>toppings</app:questioned_item>
<app:questioned_property>availability</app:questioned_property>
</app:question>
</xf:instance>
</interpretation>
W3C AC/WWW10
Hong Kong May 2001
34/41
Richer Natural Language
• Most current voice apps restrict users to
keywords or short phrases
• The application does most of the talking
• Alternative is to use open grammars with
word spotting and let user do the talking
• Rules for figuring out what the user said
and why as basis for asking next question
W3C AC/WWW10
Hong Kong May 2001
35/41
(GM/AskJeeves Demo)
Multimodal = Voice + Displays
What is the weather
in San Francisco?
• Say which City you want
weather for and see the
information on your phone
• Say which bands/CD’s you
want to buy and confirm the
choices visually
W3C AC/WWW10
Hong Kong May 2001
36/41
I want to place an order
for “Hotshot” by Shaggy.
Multimodal Interaction
•
Multimodal applications
–
–
•
•
•
•
•
Voice + Display + Key pad + Stylus etc.
User is free to switch between voice interaction and use of
display/key pad/clicking/handwriting
July 2000 Published Multimodal Requirements Draft
Demonstrations of Multimodal prototypes at Paris face to
face meeting of Voice Browser WG
Joint W3C/WAP Forum workshop on Multimodal – Hong
Kong September 2000
February 2001 – W3C publishes Multimodal Request for
Proposals
Plan to set up Multimodal Working Group later this year
assuming we get appropriate submission(s)
W3C AC/WWW10
Hong Kong May 2001
37/41
Multimodal Interaction
• Primary market is mobile wireless
– cell phones, personal digital assistants and cars
• Timescale is driven by deployment of 3G networks
• Input modes:
– speech, keypads, pointing devices, and electronic ink
• Output modes:
– speech, audio, and bitmapped or character cell displays
• Architecture should allow for both local and
remote speech processing
W3C AC/WWW10
Hong Kong May 2001
38/41
Some Ideas …
W3C is seeking detailed proposals with broad industry
support as basis for chartering multimodal working group
•
Speech enabling XHTML (and WML) without requiring changes to
markup language
–
•
Loose coupling of VoiceXML with externally defined pages written
in XHTML, SMIL, etc.
–
•
Turn-driven synchronization protocol based on SIP?
Distributed Speech Processing
–
–
–
•
New ECMAScript Speech Object?
Reduce load on wireless network and speech servers
Increase recognition accuracy in presence of noise
ETSI work on Aurora
Using pen-based gestures to constrain ASR (click and speak)
W3C AC/WWW10
Hong Kong May 2001
39/41
VoiceXML IP Issues
• Technical work on VoiceXML 2.0 is proceeding well
• Publication of VoiceXML 2.0 working draft held up over IP issues
(although internal version is accessible to W3C Members)
• Related specifications for grammar, speech synthesis, natural
language synthesis, lexicon, and call control have or shortly will be
published.
• W3C and VoiceXML Forum Management are in process of
developing a formal Memorandum of Understanding
• W3C is convening a Patent Advisory Group to recommend IP Policy
for re-chartering the Voice Browser Activity
– Draw inspiration from IETF, ECTF, ETSI and other bodies, e.g. require
all WG members to license essential IP under openly specified RAND
terms with operational criteria for effective terms expressed in terms of
exit criteria for Candidate Recommendation phase. No requirement for
advanced disclosure of IP
W3C AC/WWW10
Hong Kong May 2001
40/41
Discussion?
W3C AC/WWW10
Hong Kong May 2001
41/41
Download