Speech Recognition and Understanding Alex Acero Microsoft Research Thanks to Mazin Rahim (AT&T) “A Vision into the 21st Century” Milestones in Speech Recognition Small Vocabulary, Acoustic Phoneticsbased Isolated Words Isolated Words Connected Digits Continuous Speech Filter-bank analysis Time-normalization Dynamic programming 1967 Spoken dialog; Multiple modalities Stochastic language understanding Finite-state machines Statistical learning Concatenative synthesis Machine learning Mixed-initiative dialog Hidden Markov models Stochastic Language modeling 1972 Very Large Vocabulary; Semantics, Multimodal Dialog Continuous Speech Speech Understanding Connected Words Continuous Speech Pattern recognition LPC analysis Clustering algorithms Level building 1962 Large Vocabulary; Syntax, Semantics, Medium Large Vocabulary, Vocabulary, Template-based Statistical-based 1977 1982 Year 1987 1992 1997 2003 Multimodal System Technology Components Speech Speech Pen Gesture Visual TTS ASR Text-to-Speech Synthesis Data, Rules Words Spoken Language Generation SLG Action Words SLU DM Dialog Management Automatic Speech Recognition Meaning Spoken Language Understanding Voice-enabled System Technology Components Speech Speech TTS ASR Text-to-Speech Synthesis Data, Rules Words Spoken Language Generation SLG Action Words SLU DM Dialog Management Automatic Speech Recognition Meaning Spoken Language Understanding Automatic Speech Recognition • Goal – convert a speech signal into a text message – Accurately and efficiently – independent of the device, speaker or the environment. • Applications – Accessibility – Eyes-busy hands-busy (automobile, doctors, etc) – Call Centers for customer care – Dictation Basic Formulation • Basic equation of speech recognition is ^ W arg max P( W | X) arg max w • • • • w P ( W ) P ( X | W) arg max P( W) P( X | W) P( X) w X=X1,X2,…,Xn is the acoustic observation ˆ w w ...w is the word sequence. W 1 2 m P(X|W) is the acoustic model P(W) is the language model TTS ASR SLG SLU Speech Recognition Process DM Acoustic Model Input Speech Feature Extraction Pattern Classification (Decoding, Search) Language Model Word Lexicon Confidence “Hello World” Scoring (0.9) (0.8) Feature Extraction Acoustic Model Feature Extraction Pattern Classification Goal: •Extract robust features relevant for ASR Language Model Method: •Spectral analysis Result: •A feature vector every 10ms Challenges: • Robustness to environment (office, airport, car) • Devices (speakerphones, cellphones) • Speakers (accents, dialect, style, speaking defects) Word Lexicon Confidence Scoring Spectral Analysis • Female speech (/aa/, pitch of 200Hz) N 1 • Fourier transform X [k ] x[n]e j 2 nk / N n 0 • 30ms Hamming Window x[n]: time signal X[k]: Fourier transform Spectrograms • Short-time Fourier transform • Pitch and formant structure 0.5 0 -0.5 0 0.1 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.3 0.4 Time (seconds) 0.5 0.6 Frequency (Hz) 4000 3000 2000 1000 0 Feature Extraction Process s (t ) Noise removal, Normalization Quantization s (n) Preemphasis M,N Segmentation Energy (blocking) Zero-crossing W (n) s ' m ( n) Windowing s" m(n) Spectral Analysis a' m(l ) Cepstral Analysis am (l ) s ' ( n) Filtering cm (l ) Pitch Formants Equalization Bias removal or normalization c ' m (l ) Temporal Derivative Delta cepstrum Delta^2 cepstrum c ' m (l ) Robust Speech Recognition A mismatch in the speech signal between the training phase and testing phase results in performance degradation. Training Signal Enhancement Testing Signal Features Normalization Features Model Adaptation Model Noise and Channel Distortion y(t) = [s(t) + n(t)] * h(t) Noise n(t) h(t) + Channel Distorted Speech Speech y(t) s(t) 50dB Fourier Transform 5dB Fourier Transform frequency frequency Speaker Variations • Vocal tract length varies from 15-20cm • Longer vocal tracts =>lower frequency contents • Maximum Likelihood Speaker Normalization – Warp the frequency of a signal ' arg max Pr( | X ) 0.87 1.17 Acoustic Modeling Acoustic Model Feature Extraction Pattern Classification Language Model Confidence Scoring Word Lexicon Goal: •Map acoustic features into distinct subword units •Such as phones, syllables, words, etc. Hidden Markov Model (HMM): •Spectral properties modeled by a parametric random process •A collection of HMMs is associated with a subword unit •HMMs are also assigned for modeling extraneous events Advantages: •Powerful statistical method for a wide range of data and conditions •Highly reliable for recognizing speech Discrete-Time Markov Process • The Dow Jones Industrial Average 0.6 0.3 0.2 1 up down 0.5 0.4 0.2 0.6 L An a s M 0.5 M M 0.4 N ij 0.1 0.2 2 01 . unch. 3 Discrete-time first order Markov chain 0.5 O P P 0.5P Q 0.2 0.2 0.3 0.2 Hidden Markov Models 0.6 0.6 L An a s M 0.5 M M 0.4 N 0.3 ij 0.7 . 01 0.2 0.2 1 01 . 0.6 0.3 2 Bull Bear 0.5 0.4 0.2 0.1 0.2 3 0.3 0.3 0.4 05 . initial state prob. = 0.2 0.3 Steady P(up) P(down) P(unchanged ) output pdf = 0.5 O P P 0.5P Q 0.2 0.2 0.3 0.2 01 . Example • I observe (up, up, down, up, unchanged, up) • Is it a bull market? bear market? P(bull)=0.7*0.7*0.1*0.7*0.2*0.7*[0.5* (0.6)5]=1.867*10-4 P(bear)=0.1*0.1*0.6*0.1*0.3*0.1*[0.2* (0.3)5]=8.748*10-9 P(steady)=0.3*0.3*0.3*0.3*0.4*0.3*[0.3*(0.5)5]=9.1125*10-6 • It’s 20 times more likely that we are in a bull market than a steady market! • How about P(bull,bull,bear,bull,steady,bull)= =(0.7*0.7*0.6*0.7*0.4*0.7)*(0.5*0.6*0.2*0.5*0.2*0.4)=1.382976*10-4 Basic Problems in HMMs Given acoustic observation X and model : Evaluation: Compute P(X | ) Decoding: Choose optimal state sequence Re-estimation: Adjust to maximize P(X |) Evaluation: Forward-Backward algorithm N t (i) P( X , st i | ) t 1 ( j )a ji bi ( X t ) j 1 N t ( j ) P( X tT1 | st j , ) a ji bi ( X t 1 ) t 1 (i) i 1 t 1 t -2 t-1 s1 a1i s2 a2i si sj aNi aijbj(Xt) t =T -11; 1 j N aj1 s1 aj2 s2 aj3 a3i s3 s3 ajN sN t-2(i) t-1(i) t(j) Backward t (i, j ) P( st 1 i, st j | X 1T , ) P( st 1 i, st j , X 1T | ) P( X 1T | ) t 1 (i)aij b j ( X t ) t ( j ) N k 1 sN Forward t+1 t Xt 2 t T; 1 i N t+1(j) T (k ) Decoding: Viterbi Algorithm Step 1: Initialization D1(i)=πibi(x1), B1(i)=0 j=1,…N Step 2: Iterations for t=2,…,T { for j=1,…,N { Vt(j)=min[Vt-1(i)aij ] bj(xt) Bt(j)=argmin[Vt-1(i)aij ] }} Step 3: Backtracking The optimal score is VT = max Vt(i) Final state is sT= argmax Vt(i) Optimal path is (s1,s2,…,sT) where st=Bt+1(st+1) t=T-1,…1 j+1 j j -1 Vt-1(j+1) Vt-1(j-1) aj j Vt(j) bj(xt) aj-1 j sM b2(2) s2 aj+1 j Vt-1(j) Optimal alignment between X and S s1 x1 t -1 t x2 xT Reestimation: Baum-Welch Algorithm • Find =(A, B, ) that maximize p(X| ): T P(X | ) P( X, S | ) ast1st bst ( X t ) S S t 1 • No closed-form solution => • EM algorithm: – Start with old parameter value – Obtain a new parameter ̂ that maximizes P( X, S | ) ˆ ˆ ) Q (, aˆ ) Q (, bˆ ) Q(, ) log P( X, S | ai i bj j P ( X | ) all S • EM guaranteed to have higher likelihood Continuous Densities • Output distribution is mixture of Gaussians M M k 1 k 1 b j (x) c jk N (x, μ jk , Σ jk ) c jk b jk (x) • Posterior probabilities of state j at time t, (mixture k and state i at time t - 1): t 1 (i)aij b j ( X t ) t ( j ) ( m) a c b ( x ) ( j ) ( i , j ) ( j, k ) t N T ( m) ( m) m 1 m t 1 mj t jk jk t t N T m 1 • Reestimation Formulae: T T aˆij (i, j ) T t 1 N t (i, k ) t 1 k 1 t cˆ jk ( j, k ) t t 1 T ( j, k ) t 1 t k T T μˆ jk ( j, k )x t 1 T t ( j, k ) t 1 t t ˆ Σ jk ( j, k )(x t 1 t t μˆ jk )(xt μˆ jk )t T ( j, k ) t 1 t EM Training Procedure Input speech database Old HMM Model Estimate Posteriors t(j,k), t(i,j) Maximize parameters aij, cjk, jk, jk Updated HMM Model Design Issues • • • • • Continuous vs. Discrete HMM Whole-word vs. subword (phone units) Number of states, number of Gaussians Ergodic vs. Bakis Context-dependent vs. context-independent 0 1 2 a Training with continuous speech • No segmentation is needed • Composed HMM /sil/ /sil/ one three /sil/ Context Variability in Speech •At word/sentence level: Mr. Wright should write to Ms. Wright right away about his Ford or four door Honda. Peat Wheel 2000 1000 • At phone level: -1000 -2000 -2000 0 • Triphones capture: 0.1 0.2 0.3 4000 0 0.1 0 0.1 0.2 0.3 Time (seconds) 0.2 0.3 4000 Frequency (Hz) Frequency (Hz) – phonetic context 0 -1000 /ee/ for words peat and wheel – Coarticulation 1000 0 3000 2000 1000 0 3000 2000 1000 0 0 0.1 0.2 0.3 Time (seconds) Context-dependent models Triphone IY(P, CH) captures coarticulation, phonetic context Stress: Italy vs Italian 1000 2000 500 1000 0 0 -500 -1000 -1000 -2000 -1500 -3000 0 0.1 0.2 0.3 0 0.4 0.6 4000 Frequency (Hz) Frequency (Hz) 4000 0.2 3000 2000 1000 0 3000 2000 1000 0 0 0.1 0.2 0.3 Time (seconds) 0 0.2 0.4 Time (seconds) 0.6 Clustering similar triphones • /iy/ with two different left contexts /r/ and /w/ • Similar effects on /iy/ • Cluster those triphones together 400 400 200 200 0 0 -200 -200 -400 -400 -600 -600 0 0.1 0.2 0.3 0.4 3000 2000 1000 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 Time (seconds) 4000 Frequency (Hz) Frequency (Hz) 4000 0 0 0.1 0.2 0.3 0.4 Time (seconds) 3000 2000 1000 0 Clustering with decision trees welcome Is left phone a sonorant or nasal? yes Is right phone a back-R? Is left phone /s,z,sh,zh/? no Is right phone voiced? senone 1 senone 5 senone 6 yes Is left phone a back-L or (is left phone neither a nasal nor a Yglide and right phone a LAX-vowel)? senone 4 yes senone 2 senone 3 Other variability in Speech • Style: – discrete vs. continuous speech, – read vs spontaneous – slow vs fast • Speaker: – speaker independent – speaker dependent – speaker adapted • Environment: – additive noise (cocktail party effect) – telephone channel Acoustic Adaptation • Model adaptation needed if – Mismatched test conditions – Desire to tune to a given speaker • Maximum a Posteriori (MAP) – adds a prior for the parameters ˆ arg max [ p( | X)] arg max [ p(X | ) p()] • Maximum Likelihood Linear Regression (MLLR) – Transform mean vectors μ%ik Wc μik – Can have more than one MLLR transform – Speaker Adaptive Training (SAT) applies MLLR to training data as well MAP vs MLLR Speaker-dependent system is trained with 1000 sentences Discriminative Training • Maximum Likelihood Training: – Parameters obtained from true classes • Discriminative Training: – maximize discrimination between classes Discriminative Feature Transformation – Maximize inter-class difference to intra-class difference – Done at the state level – Linear Discriminant Analysis (LDA) Discriminative Model Training – Maximize posterior probability – Correct class and competing classes are used – Maximum mutual information (MMI); Minimum classification Error (MCE), Minimum Phone Error (MPE) Word Lexicon Goal: Map legal phone sequences into words according to phonotactic rules: David /d/ /ey/ /v/ /ih/ /d/ Acoustic Model Feature Extraction Pattern Classification Language Model Word Lexicon Multiple Pronunciation: Several words may have multiple pronunciations: Data /d/ /ae/ /t/ /ax/ Data /d/ /ey/ /t/ /ax/ Challenges: • How do you generate a word lexicon automatically? • How do you add new variant dialects and word pronunciations? Confidence Scoring The Lexicon • • • • An entry per word (> 100K words for dictation) Multiple pronunciations (tomato) Done by hand or with letter-to-sound rules (LTS) LTS rules can be automatically trained with decision trees (CART) – less than 8% errors, but proper nouns are hard! /ax/ /ey/ 0.7 0.5 0.2 0.8 /t/ /t/ /m/ 0.3 0.2 /ow/ /ow/ 0.8 0.2 0.3 /aa/ /dx/ Language Model Goal: Acoustic Model Model “acceptable” spoken phrases constrained by task syntax Rule-based: Deterministic knowledge-driven grammars: Feature Extraction Pattern Classification Language Model flying from $city to $city on $date Statistical: Compute estimate of word probabilities (N-gram, class-based, CFG) flying from Newark to Boston tomorrow 0.6 0.4 Word Lexicon Confidence Scoring Formal grammars Rewrite Rules: S NP NAME Mary VP V NP ADJ loves NP1 N 1. S NP VP 2. VP V NP 3. VP AUX VP 4. NP ART NP1 5. NP ADJ NP1 6. NP1 ADJ NP1 7. NP1 N 8. NP NAME 9. NP PRON 10. NAME Mary 11. V loves 12. ADJ that 13. N person that person Chomsky Grammar Hierarchy Types Constraints Automata Phase structure grammar . This is the most general grammar. Turing machine Context-sensitive grammar Subset of ||≤||, where |.| indicates the length of the string. Linear bounded automata Context-free grammar (CFG) Aw and ABC, where w is a terminal and B, C Push down automata Aw and AwB Finite-state automata are non-terminals. Regular grammar Ngrams P( W) P( w1 , w2 ,K , wn ) P( w1 ) P( w2 | w1 ) P( w3 | w1 , w2 )L P( wn | w1, w2 ,K , wn 1 ) n P(wi | w1, w2 ,K , wi 1 ) i 1 Trigram Estimation P( wi | wi 2, wi 1 ) C( wi 2, wi 1, wi ) C( wi 2, wi 1 ) Understanding Bigrams • Training data: – “John read her book” – “I read a different book” – “John read a book by Mark” C ( s , John ) 2 C ( s ) 3 C ( John, read ) 2 P( read | John ) C ( John ) 2 P( John | s ) C ( read , a ) 2 C ( read ) 3 C (a, book ) 1 P(book | a ) C (a ) 2 C (book , / s ) 2 P( / s | book ) C (book ) 3 P(a | read ) P( John read a book ) P( John | s ) P(read | John) P(a | read ) P(book | a) P( / s | book ) 0.148 • But we have a problem here P(Mark read a book ) P(Mark | s ) P(read | Mark ) P(a | read ) P(book | a) P( / s | book ) 0 Ngram Smoothing • Data sparseness: in millions of words more than 50% of trigrams occur only once. • Can’t assign p(wi|wi-1, wi-2)=0 • Solution: assign non-zero probability for each ngram by lowering the probability mass of seen ngrams Psmooth ( wi | wi n 1 ...wi 1 ) PML ( wi | wi n 1 ...wi 1 ) (1 ) Psmooth ( wi | wi n 2 ...wi 1 ) Perplexity • Cross-entropy of a language model on word sequence W is 1 H ( W) log2 P( W) NW • And its perplexity PP ( W ) 2 H ( W ) measures the complexity of a language model (geometric mean of branching factor). Perplexity • For digit recognition task (TIDIGITS) has 10 words, PP=10 and 0.2% error rate • Airline Travel Information System (ATIS) has 2000 words and PP=20 and a 2.5% error rate • Wall Street Journal Task has 5000 words and PP=130 with bigram and 5% error rate • In general, lower perplexity => lower error rate, but it does not take acoustic confusability into account: E-set (B, C, D, E, G, P, T) has PP=7 and has 5% error rate. Ngram Smoothing • Deleted Interpolation algorithm estimates that maximizes probability on held-out data set • We can also map all out-of-vocabulary words to the unknown word P( wi | wi 1 ) 1 C( wi 1 , wi ) (1 C(w i 1 , wi )) wi 1 C( wi 1 , wi ) V C( w i 1 , wi ) wi • Other backoff smoothing algorithms possible: Katz, Kneser-Ney, Good-Turing, class ngrams Adaptive Language Models • Cache Language Models Pcache (wi | wi n 1...wi 1 ) c Ps (wi | wi n 1...wi 1 ) (1 c ) Pcache (wi | wi 2 wi 1 ) • Topic Language Models • Maximum Entropy Language Models Bigram Perplexity •Trained on 500 million words and tested on Encarta Encyclopedia 500 Perplexity 400 300 200 100 0 10k 30k 40k Vocabulary Size 60k OOV Rate •OOV rate measured on Encarta Encyclopedia. Trained on 500 million words. 25 OOV Rate 20 15 10 5 0 10k 30k 40k Vocabulary Size 60k WSJ Results • Perplexity and word error rate on the 60000-word Wall Street Journal continuous speech recognition task • Unigrams, bigrams and trigrams were trained from 260 million words • Smoothing mechanisms Kneser-Ney Models Unigram Bigram Perplexity 1200 176 Word Error Rate 14.9% 11.3% Trigram 91 9.6% Pattern Classification Goal: Find “optimal” word sequence: Combine information (probabilities) from • Acoustic model • Word lexicon • Language model Feature Extraction Acoustic Model Pattern Classification Language Model Word Lexicon Method: Decoder searches through all possible recognition choices using a Viterbi decoding algorithm Challenge: Efficient search through a large network space is computationally expensive for large vocabulary ASR Confidence Scoring The Problem of Large Vocabulary ASR •The basic problem in ASR is to find the sequence of words that explain the input signal. This implies the following mapping Features HMM states HMM units Phones Words Sentences For the WSJ 20K vocabulary, this results in a network 22 of 10 bytes! •State—of-the-art methods include fast match, multi-pass decoding, A* stack, and finite state transducers all provide tremendous speed-up by searching through the network and finding the best path that maximizes the likelihood function Weighted Finite State Transducers (WFST) • Unified Mathematical framework to ASR • Efficiency in time and space Word:Phrase WFST Phone:Word WFST Composition HMM:Phone Optimization Search Network WFST 8 State:HMM WFST WFST can compile the network to 10 states 14 orders of magnitude more efficient Acoustic Model Confidence Scoring Goal: Pattern Classification Feature Extraction Identify possible recognition errors and out-of-vocabulary events. Potentially improves the performance of ASR, SLU and DM. Language Model Confidence Scoring Word Lexicon Method: A confidence score based on a hypothesis likelihood ratio test is associated with each recognized word: Label: Recognized: Confidence: Challenges: credit please credit fees (0.9) (0.3) P ( W | X) P( W) P( X | W) P( W) P( X | W) W Rejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech. TTS ASR SLG SLU Speech Recognition Process DM Acoustic Model Input Speech Feature Extraction Pattern Classification (Decoding, Search) Language Model Word Lexicon Confidence “Hello World” Scoring (0.9) (0.8) How to evaluate performance? • Dictation applications: Insertions, substitutions and deletions Subs Dels Ins Word Error Rate 100% No.of word in the correct sentence • Command-and-control: false rejection and false acceptance => ROC curves. ASR Performance: The Stateof-the-art CORPUS (DARPA) TYPE OF SPEECH VOCABULARY SIZE WORD ERROR RATE Connected Digit Strings Airline Travel Information North American Business News Broadcast News Read speech 10 0.3% Spontaneous 2500 2.5% Read speech 64,000 6.6% News shows 210,000 13-17% Conversational Telephone 45,000 25-29% Switchboard Vocabulary Size Growth in Effective Recognition Vocabulary Size 10000000 1000000 100000 10000 1000 100 10 1 1960 1970 1980 1990 Year 2000 2010 Improvement in Word Accuracy Word Accuracy 80 70 60 10-20% relative reduction per year. 50 40 30 1996 1997 1998 1999 2000 Year Switchboard/Call Home Vocabulary: 40,000 words Perplexity: 85 2001 2002 Human Speech Recognition vs ASR MACHINE ERROR (%) 100 10 x100 1 x10 x1 0.1 0.001 0.01 0.1 Machines Outperform Humans 1 HUMAN ERROR (%) Digits RM-LM NAB-mic WSJ RM-null NAB-omni SWBD WSJ-22dB 10 Challenges in ASR System Performance - Accuracy - Efficiency (speed, memory) - Robustness Operational Performance - End-point detection - User barge-in - Utterance rejection - Confidence scoring Machines are 10-100 times less accurate than humans Large-Vocabulary ASR Demo Multimedia Customer Care Courtesy of AT&T Voice-enabled System Technology Components Speech Speech TTS ASR Text-to-Speech Synthesis Data, Rules Words Spoken Language Generation SLG Action Words SLU DM Dialog Management Automatic Speech Recognition Meaning Spoken Language Understanding Spoken Language Understanding (SLU) • Goal: Extract and interpret the meaning of the recognized speech so that to identify a user’s request. – accurate understanding can often be achieved without correctly recognizing every word – SLU makes it possible to offer natural-language based services where the customer can speak “openly” without learning a specific set of terms. • Methodology: Exploit task grammar (syntax) and task semantics to restrict the range of meaning associated with the recognized word string. • Applications: Automation of complex operator-based tasks, e.g., customer care, customer help lines, etc. SLU Formulation • Let W be a sequence of words and C be its underlying meaning (conceptual structure), then using Bayes rule P(C | W ) P(W | C ) P(C ) / P(W ) • Finding the best conceptual structure can be done by parsing and ranking using a combination of acoustic, linguistic and semantics scores. C* arg max P(W | C ) P(C ) C Knowledge Sources for Speech Understanding ASR Acoustic/ Phonetic Phonotactic SLU Syntactic Semantic DM Pragmatic Relationship Rules for Structure of Relationship Discourse, of speech phoneme words, phrases and meanings interaction history, sounds and sequences and in a sentence among words world knowledge English phonemes pronunciation Acoustic Model Word Lexicon Language Understanding Dialog Model Model Manager SLU Components From ASR (text, lattices, n-best, history) Text Normalization Database Access Parsing/ Decoding Interpretation Morphology, Synonyms Extracting named entities, semantic concepts, syntactic tree Slot filling, reasoning, task knowledge representation To DM (concepts, entities, parse tree) Text Normalization Parsing/ Decoding Interpretation Text Normalization Goal: Reduce language variation (lexical analysis) • Morphology -Decomposing words into their “minimal unit of meaning (or grammatical analysis) • Synonyms - Finding words that mean the same (hello, hi, how do you do) • Equalization - disfluencies, non-alphanumeric, capitals, non-white space characters, etc. Text Normalization Parsing/ Decoding Interpretation Parsing/Decoding Goal: Mapping textual representation into semantic concepts using knowledge rules and/or data. • Entity Extraction (Matcher) - Finite state machines (FSM) - Context Free Grammar (CFG) • Semantic Parsing • Decoding - Tree structured meaning representation allowing arbitrary nesting of concepts - Segment an utterance into phrases each representing a concept (e.g. using HMMs) • Classification -Categorizing an utterance into one or more semantic concepts. Text Normalization Parsing/ Decoding Interpretation Interpretation Goal: Interpret user’s utterance in the context of the dialog. • Ellipses and anaphora “What about for New York?” • History Mechanism -Communicated messages are differential. - Removing ambiguities • Semantic Frames (slots) - Associating entities and relations (attribute/value pairs) - Rule-based • Database Look-up - Retrieving or checking entities DARPA Communicator • Darpa sponsored research and development of mixed-initiative dialogue systems • Travel task involving airline, hotel and car information and reservation “Yeah I uh I would like to go from New York to Boston tomorrow night with United” • SLU output (Concept decoding) Topic: Itinerary Origin: New York Destination: Boston Day of the week: Sunday Date: May 25th, 2002 Time: >6pm Airline: United XML Schema <itinerary> <origin> <city></city> <state></state> </origin> <destination> <city></city> <state></state> </destination> <date></date> <time></time> <airline></airline> </itinerary> Semantic CFG <rule name=“itinerary”> <o>Show me flights</o> <ruleref name=“origin"/> <ruleref name=“destination"/> </rule> <rule name=“origin”> from <ruleref name=“city”> </rule> <rule name=“destination”> to <ruleref name=“city”> </rule> <rule name=“city”> New York | San Francisco | Boston </rule> AT&T How May I Help You? - Customer Care Services User responds with unconstrained fluent speech System recognizes salient phrases and determines meaning of users speech, then routes the call “There is a number on my bill I didn’t make” – unrecognized number How May I help You? Unrecognized number billing credit Account balance Agent Combined bill Voice-enabled System Technology Components Speech Speech TTS ASR Text-to-Speech Synthesis Data, Rules Words Spoken Language Generation SLG Action Words SLU DM Dialog Management Automatic Speech Recognition Meaning Spoken Language Understanding Dialog Management (DM) • Goal: – Manage elaborate exchanges with the user – Provide automated access to information • Implementation: – Mathematical model programmed from rules and/or data Computational Models for DM • Structural-based Approach – Assumes that there exists a regular structure in dialog that can be modeled as a state transition network or dialog grammars – Dialog strategies need to be predefined, thus limiting – Several real-time deployed systems exist today which have inference engines and knowledge representation • Plan-based Approach – Considers communication as “acting”. Dialog acts and actions are oriented toward goal achievement – Motivated by human/human interaction in which humans generally have goals and plans when interacting with others – Aims to account for general models and theories of discourse DM Technology Components Context Interpretation TTS ASR SLG SLU DM Dialog strategies (modules) Backend (Database access) Context Interpretation Context Interpretation Dialog Strategies Backend User Input Current Context State(t) Action New Context State(t+1) • A formal representation of the context history is necessary so that for the DM to interpret a user’s utterance given previously exchanged utterances and identify a new action. • Natural communication is a differential process Context Interpretation Dialog Strategies Backend Dialog Strategies •Completion (continuation) : elicits missing input information from the user •Constraining (disambiguation): reduces the scope of the request when multiple information has been retrieved •Relaxation: increases the scope of the request when no information has been retrieved •Confirmation: verifies that the system understood correctly; strategy may differ depending on SLU confidence measure. •Reprompting: used when the system expected input but did not receive any or did not understand what it received. •Context Help: provides user with help during periods of misunderstanding of either the system or the user. •Greeting/Closing: maintains social protocol at the beginning and end of an interaction •Mixed-initiative: allows users to manage the dialog. Mixed-initiative Dialog Who manages the dialog? System Please say collect, calling card, third number. Initiative User How can I help you? Example: Mixed-Initiative Dialog • System Initiative System: Please say just your departure city. User: Chicago System: Please say just your arrival city. User: Newark Long dialogs but easier to design • Mixed Initiative System: Please say your departure city User: I need to travel from Chicago to Newark tomorrow. • User Initiative System: How may I help you? User: I need to travel from Chicago to Newark tomorrow. Shorter dialogs (better user experience) but more difficult to design Dialog Evaluation Why? • • • • • Identify problematic situations (task failure) Minimize user hang-ups and routing to an operator Improve user experience Perform data analysis and monitoring Conduct data mining for sales and marketing How? • Log exchange features from ASR, SLU, DM, ANI • Define and automatically optimize a task objective function • Compute service statistics (Instant and delayed) Optimizing Task Objective Function User Satisfaction is the ultimate objective of a dialog system • Task completion rate • Efficiency of the dialog interaction • Usability of the system • Perceived intelligibility of the system • Quality of the audio output • Perplexity and quality of the response generation Machine learning techniques can be applied to predict user satisfaction (e.g., Paradise framework) Example of Plan-based Dialog Systems. Keyboard Speech Interpretation Generation Reference Text and Display Generators Parser Interpretati on Manager Discourse Context Generation Manager Speech Displays Discourse Problemsolving Task Manager Planner Behavioral Agent Schedul er TRIPS System Architecture Reasoning Monitor s Events Help Desk Example of Structurebased Dialog Systems - AT&T Labs Natural Voices TM • A finite state engine is used to control the action of the interpreter output. • The system routes customers to appropriate agents or departments • Provide information about products, services, costs and the business. • Show demonstrations of various voice fonts and languages. • Trouble shoot problems and concerns raised by customers (in progress) Voice-enabled System Technology Components Speech Speech TTS ASR Text-to-Speech Synthesis Words to be Words synthesized Spoken Language Generation Data, Rules SLG Action Words spoken Words SLU DM Dialog Management Automatic Speech Recognition Meaning Meaning Spoken Language Understanding A Good User Interface • makes the application easy-to-use • Makes application robust to the kinds of confusion that arise in human-machine communications by voice • keeps the conversation moving forward, even in periods of great uncertainty on the parts of either the user or the machine • A great UI can not save a system with poor ASR and NLU. But, UI can make or break a system, even with excellent ASR & NLU • Effective UI design is based on a set of elementary principles common widgets; sequenced screen presentations; simple error-trap dialog; a user manual. Correction UI • N-best list Multimodal System Technology Components Speech Speech Pen Gesture Visual TTS ASR Text-to-Speech Synthesis Data, Rules Words Spoken Language Generation SLG Action Words SLU DM Dialog Management Automatic Speech Recognition Meaning Spoken Language Understanding Multimodal Experience • Access to information anytime and anywhere using any device • “Most appropriate” UI mode or combination of modes depends on the device, the task, the environment, and the user’s abilities and preferences. Multimodal Architecture What the user might do (say) P(F|x, Sn-1) Language model What does that mean P(Sn|F, Sn-1) Semantic model Parsing x (Multimodal Inputs) F (Surface Semantics) Understanding Application Logic Sn (Discourse Semantics) Rendering A (Multimedia Outputs) Behavior What to show (say) to the user P(A|Sn) model MIPad • Multimodal Interactive Pad • Usability studies show double throughput for English • Speech is mostly useful in cases with lots of alternatives MiPad video Speech-enabled MapPoint Other Research Areas not covered • Speaker recognition (verification and identification) • Language identification • Human/Human and Human/Machine translation • Multimedia and document retrieval • Speech coding • Microphone array processing • Multilingual speech recognition Application Development Speech APIs • Open-standard APIs provide separation of the ASR/TTS engines and platform from the application layer. • Application is engine independent and contains the SLU, DM, Content Server and Host Interface Application API Layer TTS Engine Audio Destination Object ASR Engine Audio Source Object Voice XML Architecture Multimedia HTML Scripts VoiceXML Audio/Grammars Web Server Voice XML Gateway Voice Browser “The temperature is” A VoiceXML example <?xml version="1.0"?> <vxml application="tutorial.vxml" version="1.0"> <form id="getPhoneNumber"> <field name="PhoneNumber"> <prompt>What's your phone number?</prompt> <grammar src="../grammars/phone.gram“ type="application/x-jsgf" /> <help>Please say your ten digit phone number. </help> <if cond=“PhoneNumber <1000000'"> </field> </form> </vxml> Speech Application Language Tags (SALT) Add 4 tags to HTML/XHTML/cHTML/WML: <listen>, <prompt>, <dtmf>, ,smex> <html> <form action="nextpage.html"> <input type=“text” id=“txtBoxCity” /> </form> <listen id=reco1> <grammar src=“cities.gram” /> <bind targetElement=“txtBoxCity” value=“city” /> </listen> </html> The Speech Advantage! • Reduce costs – Reduce labor expenses while still providing customers an easy-to-use and natural way to access information and services • New revenue opportunities – 24x7 high-quality customer care automation – Access to information without a keyboard or touch-tone • Customer retention – Stickiness of services – Add personalization Just the Beginning… Market Opportunities • Consumer communication - Voice portals - Desktop dictation - Telematic applications - Disabilities • Call center automation - Help Lines - Customer care • Voice-assisted E-commerce - B2B • Enterprise Communication - Unified messaging - Enterprise sales A look into the future.. Directory Assistance VRCP 1990+ • Constrained speech • Minimal data collection • Manual design Keyword spotting Handcrafted grammars No dialogue 1995+ • Constrained speech • Moderate data collection • Some automation Airline reservation Medium size ASR Banking Handcrafted Grammars System Initiative • Spontaneous speech • Extensive data collection • Semi-automation MATCH: Multimodal Access To Call centers, Large size ASR City Help E-commerce Limited NLU Mixed-initiative 2000+ 2005+ • Spontaneous speech/pen • Fully automated systems Unlimited ASR Deeper NLU Adaptive systems Multimodal, Multilingual Help Desks, E-commerce Spoken Dialog Interfaces Vs. Touch-Tone IVR? Reading Materials • Spoken Language Processing, X. Huang, A. Acero, H-W. Hon, Prentice Hall, 2001. • Fundamentals of Speech Recognition, L. Rabiner and B.H. Juang, Prentice Hall, 1993. • Automatic Speech and Speaker Recognition, C.H. Lee, F. Soong, K. Paliwal, Kluwer Academic Publishers, 1996. • Spoken Dialogues with Computers, Editor R. DeMori, Academic Press 1998.