VISIONS, TECHNOLOGY, AND BUSINESS OF TALKING MACHINES Roberto Pieraccini, CTO, Tell-Eureka Corporation 535 West 34th Street New York, NY 10001 +1 646 792 2744 roberto@telleureka.com http://www.telleureka.com The vision Recreating the Speech Chain DIALOG SEMANTICS SPOKEN LANGUAGE UNDERSTANDING SPEECH RECOGNITION SPEECH SYNTHESIS DIALOG MANAGEMENT SYNTAX LEXICON MORPHOLOG Y PHONETICS INNER EAR ACOUSTIC NERVE VOCAL-TRACT ARTICULATORS The technology Talking Machines: First Steps into Spoken Language Technology Homer Dudley Bell Labs (1939) Von Kempelen (1791) Joseph Faber (1835) Speech Recognition: the Early Years 1952 – Automatic Digit Recognition (AUDREY) Davis, Biddulph, Balashek (Bell Laboratories) 1960’s – Speech Processing and Digital Computers AD/DA converters and digital computers start appearing in the labs James Flanagan Bell Laboratories The Illusion of Segmentation... or... Why Speech Recognition is so Difficult (user:Roberto (attribute:telephone-num value:7360474)) VP NP NP MY IS NUMBER m I n & m &r i b THREE SEVEN SEVEN ZERO NINE s e v & nth rE n I n zE o r TWO t ü FOUR s ev & n f O r The Illusion of Segmentation... or... Ellipses and Anaphors Why Speech Recognition is so Difficult Limited vocabulary Multiple Interpretations Speaker Dependency (user:Roberto (attribute:telephone-num value:7360474)) Word variations VP ruleserrors NP MY IS rules NUMBER rules m I n & m &r i rules b THREE SEVEN ZERO NINE errors errors s e v & nth errors rE n NP Word confusability Context-dependency SEVEN TWO Coarticulation FOUR Noise/reverberation I n z E o Intra-speaker t ü s e v &variability f O r n r 1969 – Whither Speech Recognition? […] General purpose speech recognition seems far away. Social-purpose speech recognition is severely limited. It would seem appropriate for people to ask themselves why they are working in the field and what they can expect to accomplish. […] It would be too simple to say that work in speech recognition is carried out simply because one can get money for it. That is a necessary but no sufficient condition. We are safe in asserting that speech recognition is attractive to money. The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon. One doesn’t attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10%. To sell suckers, one uses deceit and offers glamour. […] Most recognizers behave, not like scientists, but like mad inventors or untrustworthy engineers. The typical recognizer gets it into his head that he can solve “the problem.” The basis for this is either individual inspiration (the “mad inventor” source of knowledge) or acceptance of untested rules, schemes, or information (the untrustworthy engineer approach). The Journal of the Acoustical Society of America, June 1969 J. R. Pierce Executive Director, Bell Laboratories 1971-1976: The ARPA SUR project In spite of the anti-speech recognition campaign headed by the Pierce Commission ARPA launches into a 5 year program on Spoken Understanding Research REQUIREMENTS: 1000 word vocabulary, 90%understanding rate, near real time on a 100 MIPS machine 4 Systems built by the end of the program SDC (24%) LESSON LEARNED: Hand-built knowledge does not scale up Need of a global “optimization” criterion BBN’s HWIM (44%) CMU’s Hearsay II (74%) CMU’s HARPY (95% -- 80 times real time!) HARPY was based on an engineering approach search on a network representing all the possible utterances Lack of a scientific evaluation approach Speech Understanding: too early for its time The project was not extended. Raj Reddy -- CMU Vintage Speech Recognition 1970’s – Dynamic Time Warping The Brute Force of the Engineering Approach TEMPLATE (WORD 7) T.K. Vyntsyuk (1969) H. Sakoe, S. Chiba (1970) Isolated Words Speaker Dependent Connected Words Speaker Independent Sub-Word Units UNKNOWN WORD 1980s -- The Statistical Approach Based on work on Hidden Markov Models done by Leonard Baum at IDA, Princeton in the late 1960s Purely statistical approach pursued by Fred Jelinek and Jim Baker at IBM T.J.Watson Research Foundations of modern speech recognition engines Wˆ arg max P( A | W ) P(W ) Fred Jelinek W Acoustic HMMs a11 a22 Word Tri-grams a33 P( wt | wt 1 , wt 2 ) Jim Baker S1 a12 S2 a23 S3 No Data Like More Data Whenever I fire a linguist, our system performance improves (1988) Some of my best friends are linguists (2004) 1980-1990 – The statistical approach becomes ubiquitous Lawrence Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, February 1989. 1980s-1990s – The Power of Evaluation 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 HOSTING MIT SPEECHWORKS SPOKEN STANDARDS DIALOG INDUSTRY APPLICATION DEVELOPERS TOOLS NUANCE SRI Pros and Cons of DARPA programs STANDARDS PLATFORM INTEGRATORS STANDARDS VENDORS + Continuous incremental improvement - Loss of “bio-diversity” TECHNOLOGY The business of speech Voice User Interface (VUI) Design—the Quantum Leap in Dialog Systems 1995 -- The WildFire Effect Change of perspective: From technology driven to user centered RESEARCH: Natural Language free form Commercial: Task completion and usability. Persona: the personality of the application (TTS vs. Recording) Speech recognition accuracy is important, but success is determined by the VUI. The importance of a repeatable, streamlined, teachable, development process The Speech Application Lifecycle Speech Scientist VUI Designer usability 8 speech science Analyst VUI Designer full deployment 7 2 1 Project Manager 3 VUI design 6 9 10 VUI development requirements 4 high level system design partial deployment 5 system engineering integration Architect, App Developer Engineer Voice User Interface Design Get Amount Interaction Module PROMPTS Type Enter Initial Transfer Get Origin Account Get Destination Account Retry 1 Get Amount Play Wrong Amount Message YES amount > origin Retry 2 account? NO Timeout 1 Play Confirmation confirmed? NO Timeout YES 2 Wording Source Please say the amount you would like to transfer from your get_amount_I_1.wav <origin-account> TTS to your get_amount_I_2.wav <destination-account> TTS in dollars origin and cents. get_amount_I_3.wav account Please say the amount you would like to transfer from your get_amount_I_1.wav <origin-account> TTS to your destination account get_amount_I_2.wav <destination-account> TTS in dollars and cents. get_amount_I_3.wav Please say the amount you would like to have transferred, like one hundred dollars and fifty cents. get_amount_R_2_1.wav amountI didn't hear you. I'm sorry, get_amount_T_1_1.wav Please say the amount you would like to transfer from your get_amount_I_1.wav <origin-account> TTS to your get_amount_I_2.wav What is wrong? <destination-account> TTS I didn't hear you this time either. Please say the amount you would like to have transferred, like one hundred dollars and fifty cents. get_amount_T_2_1.wav Go to Main Menu Please say how much do you wish to transfer. You can say the amount in Help dollars and cents, like, for instance, one hundred dollars and fifty cents. get_amount_H.wav ACTIONS CONDITION ACTION Go to "Play Wrong Amount Speech Science: Tuning for performance Accept Correct acceptance Confirm Correct confirmation Accept False acceptance - in Confirm False confirmation Correctly Recognize In Vocabulary Misrecognize Falsely Reject Recognition Out of Vocabulary Correctly Reject Falsely Accept False rejection Correct rejection False acceptance - out Speech Science: Tuning for performance DM utt# sub-err% fa-err% fr-err% rej% OOV% fa-oov% WaitPowerBothUp-2 17 5.88 0 0 5.88 5.88 0 WaitHowMuchSnow 17 5.88 11.76 5.88 23.53 29.41 40 MissingOneChannel 22 4.55 0 0 9.09 9.09 0 WPAllChannels 23 4.35 0 4.35 8.7 4.35 0 PictureBack 27 3.7 3.7 3.7 7.41 7.41 50 WaitFindInputSource 29 3.45 0 0 13.79 13.79 0 PictureProb 33 3.03 12.12 0 0 12.12 100 Utt# = Number of utterances Sub-err% = percent of in-voc utterances wrongly recognized Fa-err% = percent of utterances wrongly accepted Fr-err% = percent of utterances wrongly rejected Rej% = total percent of all utterances rejected OOV% = percent of out-voc utterances Fa-oov% = percent of out-voc utterances wrongly accepted - Prioritize grammars that need improvement Use transcriptions to improve grammars The Architectural Evolution of Spoken Dialog 1994 1998 Native Code 2000 Proprietary IVR Systems 2005 Standard Clients (VoiceXML) Standard Application servers The Voice Web SCXML? EMMA? Voice Browser Internet Web Server MRCP ASR TTS VoiceXML /SALT Telephony Platform SSML, SRGF Telephone CCXML The Evolution of the Interface and the Research-Industry Chasm Natural Language Spoken dialog as an anthropomorphic system Research Systems a-la DARPA Communicator Spoken dialog as a tool SLU: Statistical Language Understanding Large Vocabulary, Dialog Modules Directed Dialog Small Vocabulary Menu Based 1994 1996 1998 2000 2002 2004 2006 The evolution of the market and the industry HOSTING APPLICATION DEVELOPERS PROFESSIONAL SERVICES TOOLS – AUTHORING, TUNING, PREPACKAGED APPLICATIONS PLATFORM INTEGRATORS IVR, VoiceXML, CTI,… TECHNOLOGY VENDORS SPEECH RECOGNITION, TTS 600 to 1,000M$ revenue > 8000 apps worldwide New evolving standards guarantee interoperability of engines and platforms. Third generation dialog systems 1st Generation INFORMATIONAL 2nd Generation TRANSACTIONAL BANKING PACKAGE TRACKING FLIGHT STATUS LOW 3RD Generation PROBLEM SOLVING CUSTOMER CARE STOCK TRADING TECHNICAL SUPPORT FLIGHT/TRAIN RESERVATION MEDIUM COMPLEXITY HIGH 2005 -- Spoken Dialog goes to Saturday Night Live