Lecture # 1 Session 2003 Introduction to Automatic Speech Recognition • Lectures: Jim Glass & guest lecturers • Introduction to ASR – Problem definition – State of the art examples • Course overview – – – – Lecture outline Assignments Term Project Grading 6.345 Automatic Speech Recognition Introduction 1 Communication via Spoken Language Human Input Output Speech Speech Recognition Synthesis Computer Text Text Generation Understanding Meaning Meaning 6.345 Automatic Speech Recognition Introduction 2 Virtues of Spoken Language Natural: Flexible: Efficient: Economical: Requires no special training Leaves hands and eyes free Has high data rate Can be transmitted/received inexpensively Speech Speech interfaces interfaces are are ideal ideal for for information information access access and and management management when: when: •• The The information information space space isis broad broad and and complex, complex, •• The The users users are are technically technically naive, naive, or or •• Only Only telephones telephones are are available. available. 6.345 Automatic Speech Recognition Introduction 3 Diverse Sources of Constraint for Spoken Language Communication Acoustic: human vocal tract Phonetic: let us pray lettuce spray Phonological: gas shortage fish sandwich Phonotactic: blit vnuk Syntactic: I am flying to Chicago tomorrow tomorrow I flying Chicago am to Semantic: Is the baby crying Is the bay bee crying Contextual: It is easy to recognize speech It is easy to wreck a nice beach 6.345 Automatic Speech Recognition Introduction 4 Automatic Speech Recognition ASR ASR System System Speech Signal Recognized Words • An ASR system converts the speech signal into words • The recognized words can be – The final output, or – The input to natural language processing 6.345 Automatic Speech Recognition Introduction 5 Application Areas for Speech Based Interfaces • Mostly input (recognition only) – Simple command and control – Simple data entry (over the phone) – Dictation • Interactive conversation (understanding needed) – Information kiosks – Transactional processing – Intelligent agents 6.345 Automatic Speech Recognition Introduction 6 Basic Speech Recognition Challenges • Co-articulation • Speaker independence – Dialect variations – Non-native speakers • Spontaneous speech – Disfluencies – Out-of-vocabulary words • Language modelling • Noise robustness 6.345 Automatic Speech Recognition Introduction 7 Phonological Variation Example • The acoustic realization of a phoneme depends strongly on the context in which it occurs Frequency Time TEA 6.345 Automatic Speech Recognition TREE STEEP CITY BEATEN Introduction 8 Examples Contrasting Read and Spontaneous Speech (Navigation Domain) Filled and unfilled pauses: Lengthened words: False starts: 6.345 Automatic Speech Recognition read, spontaneous read, spontaneous read, spontaneous Introduction 9 Sometimes Real Data will Dictate Technology Requirements (City Name Domain) Technology Required Simple word spotting Complex word spotting Speech understanding 6.345 Automatic Speech Recognition Example Um, Braintree Eh yes, Avis rent-a-car in Boston Hello, please Brighton, uh, can I have the number of Earthscape, in, uh, on Nonantum Street Woburn, uh, Somerville. I'm sorry Introduction 10 Parameters that Characterize the Capabilities of ASR Systems Parameters Parameters Speaking Speaking Mode: Mode: Speaking Speaking Style: Style: Enrollment: Enrollment: Vocabulary: Vocabulary: Language Language Model: Model: Perplexity: Perplexity: SNR: SNR: Transducer: Transducer: 6.345 Automatic Speech Recognition Range Range Isolated Isolated word word to to continuous continuous speech speech Read Read speech speech to to spontaneous spontaneous speech speech Speaker-dependent Speaker-dependent to to speaker-independent speaker-independent Small Small (<20 (<20 words) words) to to large large (>50,000 (>50,000 words) words) Finite-state Finite-state to to context-sensitive context-sensitive Small Small (<10) (<10) to to large large (>200) (>200) High High (>30dB) (>30dB) to to low low (<10dB) (<10dB) Noise-cancelling Noise-cancelling microphone microphone to to cell cell phone phone Introduction 11 ASR Trends*: Then and Now before before mid mid 70's 70's mid mid 70’s 70’s -- mid mid 80’s 80’s after after mid mid 80’s 80’s Recognition Recognition Units: Units: whole-word whole-word and and sub-word sub-word units units sub-word sub-word units units sub-word sub-word units units Modeling Modeling Approaches: Approaches: heuristic heuristic and and ad ad hoc hoc template template matching matching mathematical mathematical and and formal formal rule-based rule-based and and declarative declarative deterministic deterministic and and data-driven data-driven probabilistic probabilistic and and data-driven data-driven Knowledge Knowledge Representation: Representation: heterogeneous heterogeneous and and complex complex homogeneous homogeneous and and simple simple homogeneous homogeneous and and simple simple Knowledge Knowledge Acquisition: Acquisition: intense intense knowledge knowledge engineering engineering embedded embedded in in simple simple structure structure automatic automatic learning learning * There are, of course, many exceptions. 6.345 Automatic Speech Recognition Introduction 12 Speech Recognition: Where Are We Now? • High performance, speaker-independent speech recognition is now possible – Large vocabulary (for cooperative speakers in benign environments) – Moderate vocabulary (for spontaneous speech over the phone) • Commercial recognition systems are now available – Dictation (e.g., Dragon, IBM, L&H, Philips) Scansoft – Telephone transactions (e.g., AT&T, Nuance, Philips, SpeechWorks, TellMe, etc.) • When well-matched to applications, technology is able to help perform real work 6.345 Automatic Speech Recognition Introduction 13 Examples of ASR Performance 1K, Read 2K, Sponaneous 20K, Read 64K, Broadcast 10K, Conversational 100 Word Error Rate (%) 10 1 2001 1999 1997 1995 1993 1991 1989 0.1 1987 • Speaker-independent, continuousspeech ASR now possible • Digit recognition over the telephone with word error rate of 0.3% • Error rate cut in half every two years for moderate vocabulary tasks • Error for spontaneous speech more than twice that of read speech • Conversational speech, involving multiple speakers and poor acoustic environment, remains a challenge • Tens of hours of training data to port to a different domain • Statistical modeling using automatic training achieves significant advances Digits Year 6.345 Automatic Speech Recognition Introduction 14 Important Lessons Learned • Statistical modeling and data-driven approaches have proved to be powerful • Research infrastructure is crucial: – Large amounts of linguistic data – Evaluation methodologies • Availability and affordability of computing power lead to shorter technology development cycles and real-time systems • Performance-driven paradigm accelerates technology development • Interdisciplinary collaboration produces enhanced capabilities (e.g., spoken language understanding) 6.345 Automatic Speech Recognition Introduction 15 Major Components in a Speech Recognition System Training Data Applying Acoustic Models Constraints Lexical Models Speech Signal Language Models Recognized Words Representation Search • Speech recognition is the problem of deciding on – How to represent the signal – How to model the constraints – How to search for the most optimal answer 6.345 Automatic Speech Recognition Introduction 16 Demo: Continuous Dictation • IBM ViaVoice running on a ThinkPad • Trained for a quiet office (classroom performance not optimal) 6.345 Automatic Speech Recognition Introduction 17 Demo: Simple Telephone Transactions • Developed by SpeechWorks International (there are others) • Shipping cost information for Fedex (1-800-GO-FEDEX) – Provides information on: * Package types * Source and destination zip codes * Weight, size, value * Service type – Handles all US rate information calls • Automated Brokerage System for E*Trade – Supports quotes and trades * Using symbols or names * For stocks, options, and mutual funds – Users can “barge in” at any time – Nationwide deployment for over 450,000 customers 6.345 Automatic Speech Recognition Introduction 18 Conversational Interfaces: The Next Generation • Enables us to converse with machines (in much the same way we communicate with one another) in order to create, access, and manage information and to solve problems • Augments speech recognition technology with natural language technology in order to understand the verbal input • Can engage in a dialogue with a user during the interaction • Uses natural language to speak the desired response • Is what Hollywood and every “futurist” says we should have! 6.345 Automatic Speech Recognition Introduction 19 A Conversational System Architecture SPEECH SYNTHESIS Sentence LANGUAGE GENERATION Speech Graphs & Tables Speech DIALOGUE MANAGEMENT DISCOURSE CONTEXT DATABASE Meaning Representation Meaning SPEECH RECOGNITION 6.345 Automatic Speech Recognition Words LANGUAGE UNDERSTANDING Introduction 20 Demo: Conversational Interface • Jupiter weather information system – Access through telephone – 500 cities worldwide – Harvest weather information from the Web several times daily Jupiter A conversational interface for on-line weather information over the phone. 1-888-573-8255 (outside the USA: 1-617-258-0300) http://www.sls.lcs.mit.edu/jupiter Spoken Language Systems Group, MIT Laboratory for Computer Science 6.345 Automatic Speech Recognition Introduction 21 45 40 35 30 25 20 15 10 5 0 100 Word Data 10 ‘97 Apr ‘98 May Jun Jul Aug Nov Apr ‘99 Nov 1 Training Data (x1000) Error Rate (%) (Real) Data Improves Performance (Weather Domain) May • Longitudinal evaluations show improvements • Collecting real data improves performance: – Enables increased complexity and improved robustness for acoustic and language models – Better match than laboratory recording conditions • Users come in all kinds 6.345 Automatic Speech Recognition Introduction 22 But We Are Far from Done! Corpus Corpus Digit Digit Strings Strings (phone) (phone) Speech Speech Type Type 10 10 0.3 0.3 0.009 0.009 Resource Resource Management Management read read 1000 1000 3.6 3.6 0.1 0.1 ATIS ATIS spontaneous spontaneous 2000 2000 22 ---- Wall Wall Street Street Journal Journal read read 64000 64000 6.6 6.6 11 Radio Radio News News mixed mixed 64000 64000 13.5 13.5 ---- Switchboard Switchboard (phone) (phone) conversation conversation 10000 10000 19.3 19.3 44 Call Call Home Home (phone) (phone) conversation conversation 10000 10000 30 30 ---- 6.345 Automatic Speech Recognition spontaneous spontaneous Lexicon Lexicon Word Word Error Error Human Human Error Error Size Rate Rate Size Rate (%) (%) Rate (%) (%) Introduction 23 Course Outline Paralinguistic Paralinguistic Information Information Speech Understanding Speech Understanding Multi-Modal Multi-Modal Interfaces Interfaces Acoustic Acoustic Theory Theory of of Speech Speech Production Production AcousticAcousticPhonetic Phonetic Modeling Modeling Pattern Pattern Recognition Recognition Finite-State Finite-State Transducers Transducers Language Language Modeling Modeling Robust Robust ASR ASR Acoustic Models Lexical Models Language Models Adaptation Adaptation Speech Signal Properties Properties of of Speech Speech Sounds Sounds Representation Search Signal Signal Representation Representation Search Search Algorithms Algorithms Vector Vector Quantization Quantization & Clustering & Clustering 6.345 Automatic Speech Recognition Recognized Words Hidden Hidden Markov Markov Modeling Modeling Graphical Graphical Models Models Segmental Segmental Models Models Introduction 24 Course Logistics • Lectures:Two sessions/week, 1.5 hours/session • Labs: All week during school hours Grading • 9 Assignments • 2 Quizzes • Term Project (about 4 weeks) 6.345 Automatic Speech Recognition 45% 30% 25% Introduction 25 Assignments • There will be 9 weekly assignments – Problems that expand on the lecture material – Lab assignments to reinforce the lecture material – Assignments are due the following week on Wednesday • Lab work will be done in the computer lab • Lab sign-up (on the course web page) is necessary • Solutions will be provided 6.345 Automatic Speech Recognition Introduction 26 Term Project • Investigate a contrasting condition in an ASR experiment • We will provide different recognizers and domains for you to select from, and will work with you to select a topic • You choose: – – – – Evaluation condition: e.g., phonetic classification, word recognition) Database (e.g., TIMIT, RM, Jupiter, Aurora, …) Recognizer (e.g., Sphinx, Summit, GMTK, …) Contrasting condition (e.g., signal representation, acoustic model, language model) • Requirements: – – – – Proposal Experiments (the bulk of the work) Write-up Presentation on extended last day of class 6.345 Automatic Speech Recognition Introduction 27 References (on reserve at Barker) • Huang, Acero, & Hon, Spoken Language Processing, Prentice-Hall, 2001. • Jelinek, Statistical Methods for Speech Recognition, MIT Press, 1997. • Rabiner & Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1983. • Duda, Hart, & Stork, Pattern Classification, Wiley & Sons, 2001. • Stevens, Acoustic Phonetics, MIT Press, 1998. 6.345 Automatic Speech Recognition Introduction 28