Seminar Speech Recognition 2003 E.M. Bakker LIACS Media Lab Leiden University LIACS Media Lab Leiden University Speech Recognition 2003 Outline • • Introduction and State of the Art A Speech Recognition Architecture – Acoustic modeling – Language modeling – Practical issues • Applications NB Some of the slides are adapted from the presentation: “Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text?”, an excellent presentation by: Dr. Patti Price, Speech Technology Consulting Menlo Park, California 94025, and Dr. Joseph Picone Institute for Signal and Information Processing Dept. of Elect. and Comp. Eng. Mississippi State University LIACS Media Lab Leiden University Speech Recognition 2003 Research Areas •Speech Analysis (Production, Perception, Parameter Estimation) •Speech Coding/Compression •Speech Synthesis (TTS) •Speaker Identification/Recognition/Verification (Sprint, TI) •Language Identification (Transparent Dialogue) •Speech Recognition (Dragon, IBM, ATT) Speech recognition sub-categories: •Discrete/Connected/Continuous Speech/Word Spotting •Speaker Dependent/Independent •Small/Medium/Large/Unlimited Vocabulary •Speaker-Independent Large Vocabulary Continuous Speech Recognition (or LVCSR for short :) LIACS Media Lab Leiden University Speech Recognition 2003 Introduction What is Speech Recognition? Goal: Automatically extract the string of words spoken from the speech signal Speech Signal Speech Recognition Words “How are you?” • Other interesting area’s: – Who is talker (speaker recognition, identification) – Speech output (speech synthesis) – What the words mean (speech understanding, semantics) LIACS Media Lab Leiden University Speech Recognition 2003 Introduction Applications •Command and control –Manufacturing –Consumer products http://www.speech.philips.com • Database query – Resource management – Air travel information – Stock quote Nuance, American Airlines: 1-800-433-7300, touch 1 •Dictation –http://www.lhsl.com/contacts/ –http://www-4.ibm.com/software/speech –http://www.microsoft.com/speech/ LIACS Media Lab Leiden University Speech Recognition 2003 Introduction: State of the Art Speech-recognition software • IBM (Via Voice, Voice Server Applications,...) – – – – • • Speaker independent, continuous command recognition Large vocabulary recognition Text-to-speech confirmation Barge in (The ability to interrupt an audio prompt as it is playing) Dragon Systems, Lernout & Hauspie (L&H Voice Xpress™ (:( ) Philips – Dictation – Telephone – Voice Control (SpeechWave, VoCon SDK, chip-sets) • Microsoft (Whisper, Dr Who) LIACS Media Lab Leiden University Speech Recognition 2003 Introduction: State of the Art Speech over the telephone.: • • • AT&T Bell Labs pioneered the use of speechrecognition systems for telephone transactions companies such as Nuance, Philips and SpeechWorks are active in this field for some years now. IBM Applications over telephone: – request news, internet pages, e-mail – stock quotes, traveling info – weather information LIACS Media Lab Leiden University Speech Recognition 2003 Introduction: State of the Art Speech over the telephone (Philips): • SpeechPearl® large vocabulary natural language • • • recognition (up to 200,000 words) SpeechMania® mixed initiative dialog gives the caller the impression of a truly natural dialogue: full replacement of the human operator. SpeechWave™ relatively small vocabularies (up to hundreds of words) available in nearly 40 languages Voice ReQuest The system recognizes the request and routes the call to the appropriate extension, all without the intervention of an operator. LIACS Media Lab Leiden University Speech Recognition 2003 Introduction: State of the Art Speech over the telephone. : • • • • • • • Brokerage: E*Trade,... Banking Travel: United Airlines, Continental Airlines BellSouth HP Federal Express Foodline LIACS Media Lab Leiden University Speech Recognition 2003 Introduction: State of the Art Speech over the telephone. LIACS Media Lab Leiden University Speech Recognition 2003 Introduction: State of the Art The following is a transcript of a telephone conversation between Victor Zue and the Jupiter system for weather information. Jupiter's toll-free number is 1-888-573-8255: • • • • • • [Phone ringing] Jupiter: Hi. Welcome to Jupiter, the M.I.T. Lab for Computer Science Weather Information System. This call is being recorded for system development. You may hang up or ask for help at any time. How can I help you? V.Z.: What is the forecast for Boston today? Jupiter: In Boston this afternoon, partly sunny and windy. High around 50. Tonight, clear, lows around 35. What other information can I give you? V.Z.: Will it rain tomorrow? Jupiter: To my knowledge, the forecast calls for no rain in Boston tomorrow. Is there anything else? LIACS Media Lab Leiden University Speech Recognition 2003 Introduction: State of the Art • V.Z.: How about Seattle? • Jupiter: To my knowledge, the forecast calls for • • • • • • • no rain in Seattle tomorrow. Is there something else you would like to know? V.Z.: Is it sunny anywhere in the Caribbean? Jupiter: Sorry, I am not sure what you said. V.Z.: Is it sunny anywhere in the Caribbean? Jupiter: I am sorry, I may not know the city you are asking about. Try asking for the state or country. For example, what cities do you know about in Massachusetts? V.Z.: What is the temperature in Montego Bay? Jupiter: In Montego Bay today, high 86 and low 73. Is there something else? V.Z.: Good-bye. LIACS Media Lab Leiden University Speech Recognition 2003 Factors that Affect Performance of Speech Recognition Systems 2005 wherever speech occurs 2000 vehicle noise radio cell phones NOISE ENVIRONMENT all speakers of the language including foreign regional accents native speakers competent foreign speakers 1995 normal office various microphones telephone quiet room fixed high – quality mic speaker independent and adaptive USER speakerdep. POPULATION 1985 careful reading SPEECH STYLE planned speech natural humanmachine dialog (user can adapt) all styles including human-human (unaware) LIACS Media Lab Leiden University application – specific speech and expert years to language create app– specific language model COMPLEXITY some application– specific data and one engineer year application independent or adaptive Speech Recognition 2003 How Do You Measure the Performance? USC, October 15, 1999: “the world's first machine system that can recognize spoken words better than humans can.” “ In benchmark testing using just a few spoken words, USC's Berger-Liaw … System not only bested all existing computer speech recognition systems but outperformed the keenest human ears.” • What benchmarks? • What was training? • What was the test? • Were they independent? • How large was the vocabulary and the sample size? • Did they really test all existing systems?Is that different from chance? • Was the noise added or coincident with speech? • What kind of noise? Was it independent of the speech? LIACS Media Lab Leiden University Speech Recognition 2003 Evaluation Metrics Word Error Rate (WER) Conversational Speech 40% 30% Broadcast News 20% Read Speech 10% Continuous Digits Digits • Spontaneous telephone speech is still a “grand challenge”. • Telephone-quality speech is still central to the problem. • Broadcast news is a very dynamic domain. Letters and Numbers Command and Control 0% Level Of Difficulty LIACS Media Lab Leiden University Speech Recognition 2003 Evaluation Metrics Human Performance Word Error Rate 20% Wall Street Journal (Additive Noise) • Human performance exceeds machine performance by a factor ranging from 4x to 10x depending on the task. • On some tasks, such as credit card number recognition, machine performance exceeds humans due to human memory retrieval capacity. 15% Machines 10% • The nature of the noise is as important as the SNR (e.g., cellular phones). 5% Human Listeners (Committee) 0% 10 dB 16 dB 22 dB Quiet • A primary failure mode for humans is inattention. • A second major failure mode is the lack of familiarity with the domain (i.e., business terms and corporation names). Speech-To-Noise Ratio LIACS Media Lab Leiden University Speech Recognition 2003 Evaluation Metrics Machine Performance 100% (Foreign) Read Speech Conversational Speech Broadcast 20k Spontaneousvocabularies Varied Speech (Foreign) Speech Microphones 10% 1k 5k Noisy 10 X • A Word Error Rate (WER) below 10% is considered acceptable. • Performance in the field is typically 2x to 4x worse than performance on an evaluation. 1% 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 LIACS Media Lab Leiden University Speech Recognition 2003 What does a speech signal look like? LIACS Media Lab Leiden University Speech Recognition 2003 Spectrogram LIACS Media Lab Leiden University Speech Recognition 2003 Speech Recognition LIACS Media Lab Leiden University Speech Recognition 2003 Recognition Architectures Why Is Speech Recognition So Difficult? • Comparison of “aa” in “IOck” vs. “iy” in bEAt for conversational speech (SWB) Feature No. 2 Ph_1 Ph_2 Ph_3 Feature No. 1 • Measurements of the signal are ambiguous. • Region of overlap represents classification errors. • Reduce overlap by introducing acoustic and linguistic context (e.g., context-dependent phones). LIACS Media Lab Leiden University Speech Recognition 2003 Overlap in the ceptral space (alphadigits) Female “aa” Female “iy” Male “aa” Male “iy” LIACS Media Lab Leiden University Speech Recognition 2003 Overlap in the cepstral space (alphadigits) Male “aa” (green) vs. Female “aa” (black) Male “iy” (blue) vs. Female “iy” (red) •Combined Comparisons: •Male "aa" (green) •Female "aa" (black) •Male "iy" (blue) •Female "iy" (red) LIACS Media Lab Leiden University Speech Recognition 2003 OVERLAP IN THE CEPSTRAL SPACE (SWB-All) The following plots demonstrate overlap of recognition features in the cepstral space. These plots consist of all vowels excised from tokens in the SWITCHBOARD conversational speech corpus. All Male Vowels All Female Vowels LIACS Media Lab Leiden University All Vowels Speech Recognition 2003 Recognition Architectures A Communication Theoretic Approach Message Source Observable: Message Linguistic Channel Articulatory Channel Acoustic Channel Words Sounds Features Bayesian formulation for speech recognition: • P(W|A) = P(A|W) P(W) / P(A) Objective: minimize the word error rate Approach: maximize P(W|A) during training Components: • P(A|W) : acoustic model (hidden Markov models, mixtures) • P(W) : language model (statistical, finite state networks, etc.) The language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams). LIACS Media Lab Leiden University Speech Recognition 2003 Recognition Architectures Incorporating Multiple Knowledge Sources • The signal is converted to a sequence of feature vectors based on spectral and temporal measurements. Input Speech Acoustic Front-end Acoustic Models P(A/W) Language Model P(W) Search Recognized Utterance LIACS Media Lab Leiden University • Acoustic models represent sub-word units, such as phonemes, as a finitestate machine in which states model spectral structure and transitions model temporal structure. • The language model predicts the next set of words, and controls which models are hypothesized. • Search is crucial to the system, since many combinations of words must be investigated to find the most probable word sequence. Speech Recognition 2003 Acoustic Modeling Feature Extraction Fourier Transform Input Speech • Typical: 512 samples (16kHz sampling rate) => Cepstral Analysis • Incorporate knowledge of the nature of speech sounds in measurement of the features. • Utilize rudimentary models of human perception. • Use a ~30 msec window for frequency domain analysis. • Include absolute energy and 12 spectral measurements. Perceptual Weighting Time Derivative Time Derivative Energy + Mel-Spaced Cepstrum Delta Energy + Delta Cepstrum Delta-Delta Energy + Delta-Delta Cepstrum • Time derivatives to model spectral change. LIACS Media Lab Leiden University Speech Recognition 2003 Acoustic Modeling Hidden Markov Models • Acoustic models encode the temporal evolution of the features (spectrum). • Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation. • Phonetic model topologies are simple left-to-right structures. • Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models. • Sharing model parameters is a common strategy to reduce complexity. LIACS Media Lab Leiden University Speech Recognition 2003 Acoustic Modeling Parameter Estimation • Initialization • Single Gaussian Estimation • • • • The expectation/maximization (EM) algorithm is used to improve our parameter estimates. • Computationally efficient training algorithms (Forward-Backward) are crucial. • Batch mode parameter updates are typically preferred. • Decision trees and the use of additional linguistic knowledge are used to optimize parameter-sharing, and system complexity,. • 2-Way Split • Mixture Distribution Reestimation • 4-Way Split • Reestimation ••• LIACS Media Lab Leiden University Word level transcription Supervises a closed-loop data-driven modeling Initial parameter estimation Speech Recognition 2003 Language Modeling Is A Lot Like Wheel of Fortune LIACS Media Lab Leiden University Speech Recognition 2003 Language Modeling N-Grams: The Good, The Bad, and The Ugly Unigrams (SWB): • Most Common: “I”, “and”, “the”, “you”, “a” • Rank-100: “she”, “an”, “going” • Least Common: “Abraham”, “Alastair”, “Acura” Bigrams (SWB): • Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think” • Rank-100: “do it”, “that we”, “don’t think” • Least Common: “raw fish”, “moisture content”, “Reagan Bush” Trigrams (SWB): • Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know” • Rank-100: “it was a”, “you know that” • Least Common: “you have parents”, “you seen Brooklyn” LIACS Media Lab Leiden University Speech Recognition 2003 Language Modeling Integration of Natural Language • Natural language constraints can be easily incorporated. • Lack of punctuation and search space size pose problems. • Speech recognition typically produces a word-level time-aligned annotation. • Time alignments for other levels of information also available. LIACS Media Lab Leiden University Speech Recognition 2003 Implementation Issues Dynamic Programming-Based Search • Dynamic programming is used to find the most probable path through the network. • Beam search is used to control resources. • Search is time synchronous and left-to-right. • Arbitrary amounts of silence must be permitted between each word. • Words are hypothesized many times with different start/stop times, which significantly increases search complexity. LIACS Media Lab Leiden University Speech Recognition 2003 Implementation Issues Cross-Word Decoding Is Expensive • Cross-word Decoding: since word boundaries don’t occur in spontaneous speech, we must allow for sequences of sounds that span word boundaries. • Cross-word decoding significantly increases memory requirements. LIACS Media Lab Leiden University Speech Recognition 2003 Implementation Issues Search Is Resource Intensive Megabytes of Memory Feature Extraction (1M) Acoustic Modeling (10M) Language Modeling (30M) Percentage of CPU Language Modeling 15% Search (150M) Feature Extraction 1% Search 25% Acoustic Modeling 59% • Typical LVCSR systems have about 10M free parameters, which makes training a challenge. • Large speech databases are required (several hundred hours of speech). • Tying, smoothing, and interpolation are required. LIACS Media Lab Leiden University Speech Recognition 2003 Applications Conversational Speech • Conversational speech collected over the telephone contains background noise, music, fluctuations in the speech rate, laughter, partial words, hesitations, mouth noises, etc. • WER (Word Error Rate) has decreased from 100% to 30% in six years. • Laughter • Singing • Unintelligible • Spoonerism • Background Speech • No pauses • Restarts • Vocalized Noise • Coinage LIACS Media Lab Leiden University Speech Recognition 2003 Applications Audio Indexing of Broadcast News Broadcast news offers some unique challenges: • Lexicon: important information in infrequently occurring words • Acoustic Modeling: variations in channel, particularly within the same segment (“ in the studio” vs. “on location”) • Language Model: must adapt (“ Bush,” “Clinton,” “Bush,” “McCain,” “???”) • Language: multilingual systems? language-independent acoustic modeling? LIACS Media Lab Leiden University Speech Recognition 2003 Applications Automatic Phone Centers • Portals: Bevocal, TellMe, HeyAniat • VoiceXML 2.0 • Automatic Information Desk • Reservation Desk • Automatic Help-Desk • With Speaker identification • bank account services • e-mail services • corporate services LIACS Media Lab Leiden University Speech Recognition 2003 Applications Real-Time Translation • From President Clinton’s State of the Union address (January 27, 2000): “These kinds of innovations are also propelling our remarkable prosperity... Soon researchers will bring us devices that can translate foreign languages as fast as you can talk... molecular computers the size of a tear drop with the power of today’s fastest supercomputers.” • Imagine a world where: • You book a travel reservation from your cellular phone while driving in your car without ever talking to a human (database query) • You converse with someone in a foreign country and neither speaker speaks a common language (universal translator) • You place a call to your bank to inquire about your bank account and never have to remember a password (transparent telephony) • You can ask questions by voice and your Internet browser returns answers to your questions (intelligent query) • Human Language Engineering: a sophisticated integration of many speech and language related technologies... a science for the next millennium. LIACS Media Lab Leiden University Speech Recognition 2003 Technology Future Directions Analog Filter Banks 1960 Dynamic Time-Warping Hidden Markov Models 1980 1970 Conclusions: Challenges: • supervised training is a good machine learning technique • • • • • large databases are essential for the development of robust statistics 2000 1990 discrimination vs. representation generalization vs. memorization pronunciation modeling human-centered language modeling The algorithmic issues for the next decade: • Better features by extracting articulatory information? • Bayesian statistics? Bayesian networks? • Decision Trees? Information-theoretic measures? • Nonlinear dynamics? Chaos? LIACS Media Lab Leiden University Speech Recognition 2003