VISIONS, TECHNOLOGY, AND BUSINESS OF TALKING MACHINES Roberto Pieraccini, CTO, Tell-Eureka Corporation

advertisement
VISIONS, TECHNOLOGY, AND BUSINESS OF
TALKING MACHINES
Roberto Pieraccini,
CTO, Tell-Eureka Corporation
535 West 34th Street
New York, NY 10001
+1 646 792 2744
roberto@telleureka.com
http://www.telleureka.com
The vision
Recreating the Speech Chain
DIALOG
SEMANTICS
SPOKEN
LANGUAGE
UNDERSTANDING
SPEECH
RECOGNITION
SPEECH
SYNTHESIS
DIALOG
MANAGEMENT
SYNTAX
LEXICON
MORPHOLOG
Y
PHONETICS
INNER EAR
ACOUSTIC
NERVE
VOCAL-TRACT
ARTICULATORS
The technology
Talking Machines: First Steps into Spoken Language
Technology
Homer Dudley
Bell Labs
(1939)
Von Kempelen
(1791)
Joseph Faber
(1835)
Speech Recognition: the Early Years
 1952 – Automatic Digit Recognition (AUDREY)
Davis,
Biddulph, Balashek (Bell Laboratories)
1960’s – Speech Processing and Digital Computers
 AD/DA converters and digital computers start
appearing in the labs
James Flanagan
Bell Laboratories
The Illusion of Segmentation... or...
Why Speech Recognition is so Difficult
(user:Roberto (attribute:telephone-num value:7360474))
VP
NP
NP
MY
IS
NUMBER
m I n & m &r i
b
THREE
SEVEN
SEVEN
ZERO
NINE
s e v & nth
rE n
I n zE o
r
TWO
t ü
FOUR
s ev &
n
f
O
r
The Illusion of Segmentation... or...
Ellipses
and Anaphors
Why Speech Recognition is so
Difficult
Limited vocabulary
Multiple Interpretations
Speaker Dependency
(user:Roberto (attribute:telephone-num value:7360474))
Word variations
VP
ruleserrors
NP
MY
IS
rules
NUMBER
rules
m I n & m &r i
rules
b
THREE
SEVEN
ZERO
NINE
errors
errors
s e v & nth errors
rE n
NP
Word confusability
Context-dependency
SEVEN
TWO Coarticulation FOUR
Noise/reverberation
I n z E o Intra-speaker
t ü s e v &variability
f O
r
n
r
1969 – Whither Speech Recognition?
[…] General purpose speech recognition seems far away. Social-purpose
speech recognition is severely limited. It would seem appropriate for
people to ask themselves why they are working in the field and what
they can expect to accomplish.
[…] It would be too simple to say that work in speech recognition is
carried out simply because one can get money for it. That is a
necessary but no sufficient condition. We are safe in asserting that
speech recognition is attractive to money. The attraction is perhaps
similar to the attraction of schemes for turning water into gasoline,
extracting gold from the sea, curing cancer, or going to the moon. One
doesn’t attract thoughtlessly given dollars by means of schemes for
cutting the cost of soap by 10%. To sell suckers, one uses deceit and
offers glamour.
[…] Most recognizers behave, not like scientists, but like mad inventors or
untrustworthy engineers. The typical recognizer gets it into his head
that he can solve “the problem.” The basis for this is either individual
inspiration (the “mad inventor” source of knowledge) or acceptance of
untested rules, schemes, or information (the untrustworthy engineer
approach).
The Journal of the Acoustical Society of America, June 1969
J. R. Pierce
Executive Director,
Bell Laboratories
1971-1976: The ARPA SUR project
 In spite of the anti-speech recognition campaign headed by the
Pierce Commission ARPA launches into a 5 year program on
Spoken Understanding Research
 REQUIREMENTS: 1000 word vocabulary, 90%understanding
rate, near real time on a 100 MIPS machine
 4 Systems built by the end of the program
SDC
(24%)
LESSON LEARNED:
Hand-built knowledge does not scale up
Need of a global “optimization” criterion
BBN’s
HWIM (44%)
CMU’s
Hearsay II (74%)
CMU’s
HARPY (95% -- 80 times real time!)
 HARPY was based on an engineering approach
search
on a network representing all the possible utterances
 Lack of a scientific evaluation approach
 Speech Understanding: too early for its time
The project was not extended.
Raj Reddy -- CMU
Vintage Speech Recognition
1970’s – Dynamic Time Warping
The Brute Force of the Engineering Approach
TEMPLATE (WORD 7)
T.K. Vyntsyuk (1969)
H. Sakoe,
S. Chiba (1970)
Isolated Words
Speaker Dependent
Connected Words
Speaker Independent
Sub-Word Units
UNKNOWN WORD
1980s -- The Statistical Approach
 Based on work on Hidden Markov Models done by
Leonard Baum at IDA, Princeton in the late 1960s
 Purely statistical approach pursued by Fred Jelinek and
Jim Baker at IBM T.J.Watson Research
 Foundations of modern speech recognition engines
Wˆ  arg max P( A | W ) P(W )
Fred Jelinek
W
Acoustic HMMs
a11
a22
Word Tri-grams
a33
P( wt | wt 1 , wt 2 )
Jim Baker
S1
a12
S2
a23
S3
 No Data Like More Data
 Whenever I fire a linguist, our system
performance improves (1988)
 Some of my best friends are linguists (2004)
1980-1990 – The statistical approach becomes
ubiquitous
 Lawrence Rabiner, A Tutorial on Hidden
Markov Models and Selected Applications in
Speech Recognition, Proceeding of the IEEE,
Vol. 77, No. 2, February 1989.
1980s-1990s – The Power of Evaluation
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
HOSTING
MIT
SPEECHWORKS
SPOKEN
STANDARDS
DIALOG
INDUSTRY
APPLICATION
DEVELOPERS
TOOLS
NUANCE
SRI
Pros and Cons of DARPA programs
STANDARDS
PLATFORM
INTEGRATORS
STANDARDS
VENDORS
+ Continuous incremental improvement
- Loss of “bio-diversity” TECHNOLOGY
The business of speech
Voice User Interface (VUI) Design—the Quantum Leap
in Dialog Systems
 1995 -- The WildFire Effect
 Change of perspective: From
technology driven to user centered
RESEARCH:
Natural Language free
form
Commercial:
Task completion and
usability.
 Persona: the personality of the
application (TTS vs. Recording)
 Speech recognition accuracy is
important, but success is
determined by the VUI.
 The importance of a repeatable,
streamlined, teachable,
development process
The Speech Application Lifecycle
Speech Scientist
VUI Designer
usability
8
speech science
Analyst
VUI Designer
full
deployment
7
2
1
Project
Manager
3
VUI design
6
9
10
VUI development
requirements
4
high level
system design
partial
deployment
5
system
engineering
integration
Architect,
App Developer
Engineer
Voice User Interface Design
Get Amount
Interaction Module
PROMPTS
Type
Enter Initial
Transfer
Get Origin
Account
Get Destination
Account
Retry 1
Get Amount
Play Wrong
Amount
Message
YES
amount >
origin
Retry 2
account?
NO
Timeout
1
Play
Confirmation
confirmed?
NO
Timeout
YES
2
Wording
Source
Please say the amount you would like to transfer from your
get_amount_I_1.wav
<origin-account>
TTS
to your
get_amount_I_2.wav
<destination-account>
TTS
in dollars
origin and cents.
get_amount_I_3.wav
account
Please say the amount you would like to transfer from your
get_amount_I_1.wav
<origin-account>
TTS
to your
destination
account
get_amount_I_2.wav
<destination-account>
TTS
in dollars and cents.
get_amount_I_3.wav
Please say the amount you would like to have transferred, like one hundred
dollars and fifty cents.
get_amount_R_2_1.wav
amountI didn't hear you.
I'm sorry,
get_amount_T_1_1.wav
Please say the amount you would like to transfer from your
get_amount_I_1.wav
<origin-account>
TTS
to your
get_amount_I_2.wav
What is wrong?
<destination-account>
TTS
I didn't hear you this time either. Please say the amount you would like to have
transferred, like one hundred dollars and fifty cents.
get_amount_T_2_1.wav
Go to Main Menu
Please say how much do you wish to transfer. You can say the amount in
Help
dollars and cents, like, for instance, one hundred dollars and fifty cents.
get_amount_H.wav
ACTIONS
CONDITION
ACTION
Go to "Play Wrong Amount
Speech Science: Tuning for performance
Accept
Correct acceptance
Confirm
Correct confirmation
Accept
False acceptance - in
Confirm
False confirmation
Correctly
Recognize
In
Vocabulary
Misrecognize
Falsely
Reject
Recognition
Out of
Vocabulary
Correctly
Reject
Falsely
Accept
False rejection
Correct rejection
False acceptance - out
Speech Science: Tuning for performance
DM
utt#
sub-err%
fa-err%
fr-err%
rej%
OOV%
fa-oov%
WaitPowerBothUp-2
17
5.88
0
0
5.88
5.88
0
WaitHowMuchSnow
17
5.88
11.76
5.88
23.53
29.41
40
MissingOneChannel
22
4.55
0
0
9.09
9.09
0
WPAllChannels
23
4.35
0
4.35
8.7
4.35
0
PictureBack
27
3.7
3.7
3.7
7.41
7.41
50
WaitFindInputSource
29
3.45
0
0
13.79
13.79
0
PictureProb
33
3.03
12.12
0
0
12.12
100
Utt# = Number of utterances
Sub-err% = percent of in-voc utterances wrongly recognized
Fa-err% = percent of utterances wrongly accepted
Fr-err% = percent of utterances wrongly rejected
Rej% = total percent of all utterances rejected
OOV% = percent of out-voc utterances
Fa-oov% = percent of out-voc utterances wrongly accepted
-
Prioritize grammars that
need improvement
Use transcriptions to
improve grammars
The Architectural Evolution of Spoken Dialog
1994
1998
Native
Code
2000
Proprietary
IVR Systems
2005
Standard
Clients
(VoiceXML)
Standard
Application
servers
The Voice Web
SCXML?
EMMA?
Voice
Browser
Internet
Web Server
MRCP
ASR
TTS
VoiceXML
/SALT
Telephony
Platform
SSML, SRGF
Telephone
CCXML
The Evolution of the Interface
and the Research-Industry Chasm
Natural
Language
Spoken dialog as an
anthropomorphic
system
Research Systems a-la DARPA
Communicator
Spoken dialog
as a tool
SLU: Statistical Language
Understanding
Large Vocabulary, Dialog Modules
Directed
Dialog
Small Vocabulary Menu Based
1994
1996
1998
2000
2002
2004
2006
The evolution of the market and the industry
HOSTING
APPLICATION DEVELOPERS
PROFESSIONAL SERVICES
TOOLS – AUTHORING, TUNING,
PREPACKAGED APPLICATIONS
PLATFORM INTEGRATORS
IVR, VoiceXML, CTI,…
TECHNOLOGY VENDORS
SPEECH RECOGNITION, TTS
600 to
1,000M$
revenue
> 8000 apps
worldwide
New evolving
standards
guarantee
interoperability of
engines and
platforms.
Third generation dialog systems
1st Generation
INFORMATIONAL
2nd Generation
TRANSACTIONAL
BANKING
PACKAGE
TRACKING
FLIGHT
STATUS
LOW
3RD Generation
PROBLEM SOLVING
CUSTOMER
CARE
STOCK
TRADING
TECHNICAL
SUPPORT
FLIGHT/TRAIN
RESERVATION
MEDIUM
COMPLEXITY
HIGH
2005 -- Spoken Dialog goes to Saturday Night Live
Download