Machine Translation & Automated Speech Recognition Jaime Carbonell With Richard Stern and Alex Rudnicky Language Technologies Institute Carnegie Mellon University Telephone: +1 412 268 7279 Fax: +1 412 268 6298 jgc@cs.cmu.edu For Samsung, February 20, 2008 Outline of Presentation Machine Translation (MT): – Three major paradigms for MT (transfer, example-based, statistical) – New method: Context-Based MT Automated Speech Recognition (ASR): – The Sphinx family of recognizers Environmental Robustness for ASR Intelligent Tutoring for English – Native Accent: Pronunciation product – Grammar, Vocabulary & other tutotring Components for Virtual English Village Carnegie Mellon Slide 2 Languge Technologies Insitute Language Technologies BILL OF RIGHTS TECHNOLGY (e.g.) “…right information” IR (search engines) “…right people” Routing, personalization “…right time” Anticipatory analysis “…right medium” Speech: ASR, TTS “…right language” Machine translation “…right level of detail” Carnegie Mellon Summarization, Drill Slide 3 down Languge Technologies Insitute Machine Translation: Context is Needed to Resolve Ambiguity Example: English Japanese Power line – densen (電線) Subway line – chikatetsu (地下鉄) (Be) on line – onrain (オンライン) (Be) on the line – denwachuu (電話中) Line up – narabu (並ぶ) Line one’s pockets – kanemochi ni naru (金持ちになる) Line one’s jacket – uwagi o nijuu ni suru (上着を二重にする) Actor’s line – serifu (セリフ) Get a line on – joho o eru (情報を得る) Carnegie Mellon Sometimes local context suffices (as above) . . . but sometimes not Slide 4 Languge Technologies Insitute CONTEXT: More is Better Examples requiring longer-range context: – “The line for the new play extended for 3 blocks.” – “The line for the new play was changed by the scriptwriter.” – “The line for the new play got tangled with the other props.” – “The line for the new play better protected the quarterback.” Better MT requires better context sensitivity – Syntax-rule based MT little or no context – EBMT potentially long context on exact substring matches – SMT short context with proper probabilistic optimization – CBMT long context with optimization (but still experimental) Carnegie Mellon Slide 5 Languge Technologies Insitute Types of Machine Translation Interlingua Semantic Analysis Syntactic Parsing Source (Arabic) Carnegie Mellon Sentence Planning Transfer Rules Direct: SMT, EBMT Slide 6 Text Generation Target (English) Languge Technologies Insitute Example-Based Machine Translation Example English: I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English: The tallest man is my father. Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw. English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu. Carnegie Mellon Slide 7 Languge Technologies Insitute Statistical Machine Translation: The Three Ingredients English Language Model E p(E) English-Korean Translation Model K p(K|E) K E* Decoder E*=arg max Carnegie Mellon p(E|K)= argmax p(E) p(K|E) Slide 8 Languge Technologies Insitute Alignments The Les proposal Propositions Will ne Not seront Now Pas Be mises En Implemented Application maintenant Translation models built in terms of hidden alignments p(F|E)= p( F , A | E ) A Each English word “generates” zero or more French words. Conditioning Joint Probabilities p( F , A | E ) p(m length( F ) | E m j 1 j 1 p ( a | a , f j 1 1 , m, E ) j 1 p( f j | a1j , f1 j 1 , m, E ) The modeling problem is how to approximate these terms to achieve: Statistically robust estimates Computationally efficient training An effective and accurate model Carnegie Mellon Slide 10 Languge Technologies Insitute The Gang of Five IBM++ Models A series of five models is sequentially trained to “bootstrap a detailed translation model from parallel corpora: 1. Word-byword translation. Parameters p(f|e) between pairs of words. 2. Local Addslengths) alignment probabilities p(aalignments. j | j , sentence 3. Fertilities and distortions. Allow a source word to generate zero or more p(φp(j|i,sentence ( e) | e) target words, then place words in position with lengths). 4. Tablets. Groups words into phrases • Primary idea of Example-Based MT incorporated into SMT • Currently seeking meaningful (vs arbitrary phrases), e.g. NPs. PPs. 5. Carnegie Mellon Non-deficient alignments. Don’t ask. . . Slide 11 Languge Technologies Insitute Multi-Engine Machine Transaltion MT Systems have different strengths – Rapidly adaptable: Statistical, example-based – Good grammar: Rule-Based (linguisitic) MT – High precision in narrow domains: KBMT – Minority Language MT: Learnable from informant Combine results of parallel-invoked MT – Select best of multiple translations – Selection based on optimizing combination of: » Target language joint-exponential model » Confidence scores of individual MT engines Carnegie Mellon Slide 12 Languge Technologies Insitute Illustration of Multi-Engine MT El punto de descarge se cumplirá en el puente Agua Fria The drop-off point will comply with The cold Bridgewater El punto de descarge se cumplirá en el puente Agua Fria The discharge point will self comply in the “Agua Fria” bridge El punto de descarge se cumplirá en el puente Agua Fria Unload of the point will take place at the cold water of bridge Carnegie Mellon Slide 13 Languge Technologies Insitute Key Challenges for Corpus-Based MT Long-Range Context Parallel Corpora – More is better – More is better – How to incorporate it? – Where to find enough of it? – How to do so tractably? – What about rare languages? – “I conjecture that MT is AI complete” – H. Simon – “There’s no data like more data” – IBM SMT circa 1992 State-of-the-Art Carnegie Mellon State-of-the-Art – Trigrams 4 or 5-grams in LM only (e.g., Google) – Avoids rare languages (GALE only does Arabic & Chinese) – SMT SMT+EBMT (phrase tables, etc.) – Symbolic rule-learning from minimalist parallel corpus (AVENUE project at CMU) Slide 14 Languge Technologies Insitute Parallel Text: Requiring Less is Better (Requiring None is Best ) Challenge – There is just not enough to approach human-quality MT for most major language pairs (we need ~100X more) – Much parallel text is not on-point (not on domain) – Rare languages and some distant pairs have virtually no parallel text Context-Based (CBMT) Approach – Invented and patented by Meaningful Machines (in New York) – Requires no parallel text, no transfer rules . . . – Instead, CBMT needs » A fully-inflected bilingual dictionary » A (very large) target-language-only corpus » A (modest) source-language-only corpus [optional, but preferred] Carnegie Mellon Slide 15 Languge Technologies Insitute CMBT System Source Language N-gramParser Segmenter Bilingual Dictionary Flooder (non-parallel text method) Target Corpora [Source Corpora] Gazetteers Edge Locker Cross-Language N-gram Database TTR N-gram Candidates Substitution Request Stored N-gram Pairs Overlap-based Decoder Target Language Approved N-gram Pairs Step 1: Source Sentence Chunking Segment source sentence into overlapping n-grams via sliding window Typical n-gram length 4 to 9 terms Each term is a word or a known phrase Any sentence length (for BLEU test: ave-27; shortest-8; longest-66 words) S1 S2 S3 S4 S5 S6 S7 S8 S1 S2 S3 S4 S5 S2 S3 S4 S5 S6 S3 S4 S5 S6 S7 S4 S5 S6 S7 S8 S5 S6 S7 S8 S9 S9 Step 2: Dictionary Lookup Using bilingual dictionary, list all possible target translations for each source word or phrase Source Word-String S2 S3 S4 S5 S6 Inflected Bilingual Dictionary Target Word Lists Flooding Set T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-a T6-a T6-b T6-c Step 3: Search Target Text Using the Flooding Set, search target text for word-strings containing one word from each group Flooding Set T2-a T2-b T2-c T2-d T3-a T3-b T3-c T4-a T4-b T4-c T4-d T4-e T5-a T6-a T6-b T6-c Find maximum number of words from Flooding Set in minimum length wordstring – Words or phrases can be in any order – Ignore function words in initial step (T5 is a function word in this example) Step 3: Search Target Text (Example) Flooding Set Target Corpus Target Candidate 1 T2-a T2-b T2-c T2-d T(x) T(x) T(x) T(x) T(x) T(x) T3-a T3-b T3-c T(x) T3-b T3-b T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T4-a T4-b T4-c T4-d T4-e T(x) T2-d T2-d T(x) T(x) T(x) T(x) T5-a T(x) T(x) T(x) T(x) T(x) T(x) T(x) T6-a T6-b T6-c T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T6-c T6-c T(x) T(x) T(x) T(x) Step 3: Search Target Text (Example) Flooding Set Target Corpus Target Candidate 2 T2-a T2-b T2-c T2-d T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-a T3-b T3-c T(x) T(x) T(x) T(x) T(x) T(x) T(x) T4-a T4-b T4-c T4-d T4-e T(x) T(x) T(x) T4-a T(x) T4-a T(x) T(x) T(x) T(x) T(x) T6-b T(x) T6-b T(x) T(x) T5-a T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T6-a T6-b T6-c T(x) T(x) T(x) T2-c T(x) T2-c T(x) T(x) T(x) T(x) T(x) T3-a T(x) T3-a T(x) T(x) Step 3: Search Target Text (Example) Flooding Set Target Corpus T2-a T2-b T2-c T2-d T(x) T(x) T(x) T(x) T(x) T3-c T3-c T(x) T3-a T3-b T3-c T(x) T(x) T(x) T(x) T(x) T2-b T2-b T(x) T4-a T4-b T4-c T4-d T4-e T(x) T(x) T(x) T(x) T(x) T4-e T4-e T(x) T(x) T(x) T(x) T(x) T(x) T5-a T5-a T(x) T5-a T(x) T(x) T(x) T(x) T(x) T6-a T6-a T(x) T6-a T6-b T6-c T(x) T(x) T(x) T(x) T(x) T(x) Target Candidate 3 Reintroduce function words after initial match (e.g. T5) T(x) T(x) T(x) T(x) T(x) T(x) Step 4: Score Word-String Candidates Scoring of candidates based on: – Proximity (minimize extraneous words in target n-gram precision) – Number of word matches (maximize coverage recall)) – Regular words given more weight than function words – Combine results (e.g., optimize F1 or p-norm or …) Target Word-String Candidates T6-c “Regular” Word Proximity Matches Words Total Scoring Scoring 3rd ---3rd T3-b T(x) T2-d T(x) T(x) T4-a T6-b T(x) T2-c T3-a 1st1st 2st ---2nd T3-c T2-b T4-e T5-a T6-a ---1st1st Step 5: Select Candidates Using Overlap (Propagate context over entire sentence) Word-String 1 Candidates Word-String 2 Candidates Word-String 3 Candidates T(x1) T2-d T3-c T(x2) T(x1) T3-c T2-b T4-e T(x2) T4-a T6-b T(x3) T4-b T2-c T3-b T(x3) T2-d T(x5) T(x6) T4-a T6-b T(x3) T2-c T3-a T3-c T2-b T4-e T5-a T6-a T6-c T2-b T4-e T5-a T6-a T(x8) T6-b T(x11) T2-c T3-a T(x9) T6-b T(x3) T2-c T3-a T(x8) Step 5: Select Candidates Using Overlap Best translations selected via maximal overlap T(x2) Alternative 1 T(x2) T(x1) Alternative 2 T(x1) Carnegie Mellon T4-a T6-b T(x3) T2-c T4-a T6-b T(x3) T2-c T4-a T3-a T6-b T(x3) T2-c T3-a T(x8) T6-b T(x3) T2-c T3-a T(x8) T3-c T2-b T4-e T3-c T2-b T4-e T5-a T6-a T2-b T4-e T5-a T6-a T(x8) T2-b T4-e T5-a T6-a T(x8) T3-c Slide 25 Languge Technologies Insitute A (Simple) Real Example of Overlap Flooding N-gram fidelity Overlap Long range fidelity a United States soldier N-grams generated from Flooding United States soldier died soldier died and two others died and two others were injured two others were injured Monday N-grams connected via Overlap Systran a United States soldier died and two others were injured Monday A soldier of the wounded United States died and other two were east Monday Comparative System Scores 0.85 0.8 Human Scoring Range BLEU SCORES 0.7209 0.7447 0.7533 0.7 0.6 0.5551 0.5610 Systran Spanish SDL Spanish 0.5137 0.5 0.4 0.3859 0.3 Google Google Chinese Arabic (‘06 NIST) (‘05 NIST) MM Google Spanish Spanish ’08 top lang Based on same Spanish test set MM Spanish (Non-blind) Historical CBMT Scoring 0.85 BLEU SCORES 0.8 Human Scoring Range .6456 0.6 .6144 .6374 .6165 .5670 .5953 .6129 .7354 .7365 .7447 .7059 .6950 0.7 .7533 .6645 .6694 .6929 .7276 .6393 .6267 Blind 0.5 Non-Blind 0.4 .3743 0.3 Apr 04 Aug 04 Dec 04 Apr 05 Aug 05 Dec 05 Apr 06 Aug 06 Dec 06 Apr 07 Aug 07 Dec 07 Illustrative Translation Examples Un coche bomba estalla junto a una comisaría de policía en Bagdad MM: A car bomb explodes next to a police station in Baghdad Systran: A car pump explodes next to a police station of police in bagdad Hamas anunció este jueves el fin de su cese del fuego con Israel MM: Hamas announced Thursday the end of the cease fire with israel Systran: Hamas announced east Thursday the aim of its cease-fire with Israel Testigos dijeron haber visto seis jeeps y dos vehículos blindados MM: Witnesses said they have seen six jeeps and two armored vehicles Systran: Six witnesses said to have seen jeeps and two armored vehicles A More Complex Example Un soldado de Estados Unidos murió y otros dos resultaron heridos este lunes por el estallido de un artefacto explosivo improvisado en el centro de Bagdad, dijeron funcionarios militares estadounidenses. MM: A United States soldier died and two others were injured monday by the explosion of an improvised explosive device in the heart of Baghdad, American military officials said. Systran: A soldier of the wounded United States died and other two were east Monday by the outbreak from an improvised explosive device in the center of Bagdad, said American military civil employees. Summing up CBMT What if a source word or phrase is not in the bilingual dictionary? – Automatically Find near synonyms in source, – Replace and retranslate Current Status – Developing a Chinese-to-English CBMT system (no results yet) – Pre-commercial (field-testing) – CMU participating in improving CBMT technology Advantages – Most accurate MT method in existance – Does not require parallel text Disadvantages – Run time speed: 20s/w (7/07) 1.5s/w (2/08) 0.1s/w(12/08) – Requires heavy-duty servers (not yet miniturizable) Flat Seed Generation The highly qualified applicant visits the company. Der äußerst qualifizierte Bewerber besucht die Firma. ((1,1),(2,2),(3,3),(4,4),(5,5),(6,6)) S::S [det adv adj n v det n] [det adv adj n v det n] ((x1::y1) (x2::y2)…. ((x4 agr) = *3-sing) … ((y3 case) = *nom)…) Compositionality NP::NP [det adv adj n] Adjust rule to [det adv adj n] reflect ((x1::y1)… compositionality ((y4 agr) = (x4 agr) ….) S::S [NP v det n] NP rule can be used to [NP n v det n] translate part of the sentence; keep / add ((x1::y1) (x2::y2)…. context constraints, Carnegie … eliminate unnecessary Slide 32 ((y1Mellon case) = *nom)…) ones Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word-aligned sentence pairs, abstracted only to POS level; no syntactic structure 2) Add compositional structure to Seed Rule by exploiting previously learned rules 3) Seeded Version Space Learning group seed rules by constituent sequences and alignments, seed rules form s-boundary of VS; generalize with validation Seeded Version Space Learning Group seed rules into version spaces: … … … Merge two rules: 1) Deletion of constraint Notes: 2) Raising of two value to one 1) Partial order of Agreement constraint, e.g. rules in VS ((x1 num) = *pl), ((x3 num) = *pl) 2) Generalization ((x1Insitute num) = (x3 num) Languge Technologies via merging 3) Use merged rule to translate NP v det n Version Spaces (Mitchell, 1980) Anything G boundary b “Target” concept S boundary Specific Instances Carnegie Mellon Slide 33 Languge Technologies Insitute N Original & Seeded Version Spaces Version-spaces (Mitchell, 1980) – Symbolic multivariate learning – S & G sets define lattice boundaries – Exponential worst-case: O(bN) Seeded Version Spaces (Carbonell, 2002) – Generality level hypothesis seed – S & G subsets effective lattice – Polynomial worst case: O(bk/2), k=3,4 Carnegie Mellon Slide 34 Languge Technologies Insitute Seeded Version Spaces (Carbonell, 2002) Xn Ym G boundary Det Adj N Det N Adj (Y2 num) = (Y3 num) (Y2 gen) = (Y3 gen) (X3 num) = (Y2 num) “Target” concept S boundary “The big book” “ el libro grande” Carnegie Mellon Slide 35 Languge Technologies Insitute N Seeded Version Spaces Xn Ym G boundary k Seed (best guess) “Target” concept S boundary “The big book” “ el libro grande” Carnegie Mellon Slide 36 Languge Technologies Insitute N Transfer Rule Learning One SVS per transfer rule – Alignment + generalization level = seed – Apply SVS(seed,k) learning method » High confidence seed small k [k too small increases prob(failure)] » Low confidence seed larger k [k too large increases computation & elicitation] For translation use SVS f-centroid – If success % > threshold, halt SVS – If no rule found, k := K+1 & redo (to max K) Carnegie Mellon Slide 37 Languge Technologies Insitute Sample learned Rule Carnegie Mellon Slide 38 Languge Technologies Insitute Compositionality ;;L2: h &niin h rxb b h bxirwt Training example ;;L1: the widespread interest in the election NP::NP Rule type [Det N Det Adj PP] -> [Det Adj N PP] Component sequences ((X1::Y1)(X2::Y3) (X3::Y1)(X4::Y2) Component alignments (X5::Y4) ((Y3 num) = (X2 num)) Agreement constraints ((X2 num) = sg) ((X2 gen) = m) ((Y1 def) = +) ) Value constraints Automated Speech Recognition at CMU: Sphinx and related technologies Sphinx is the name for a family of speech recognition engines – developed by Reddy, Lee & Rudnicky – First continuous-speech, speaker-independent large vocabulary system – Standard continuous HMM technology – Multiple languages – Open source code base – Miniaturized and integrated with MT, dialog, and TTS – Basis for Native Accent & Carnegies Speech prouducts Janus system developed–by Waibel and Schultz – Started as Neural-Net Approach – Now standard continuous HMM system – Multiple languages – NOT open source code base – Held WER advantage over Sphinx till recently (now both are equal) Carnegie Mellon Slide 40 Languge Technologies Insitute Sphinx ASR at Carnegie Mellon Sphinx systems have evolved along two parallel batch Live ASR Technology innovation occurs in the Research branch, then transitions to the Live branch The Live branch focuses on the additional requirements of interactive real-time recognition and on the integration with higher-level components such as spoken language understanding and dialog management Research ASR Sphinx 1986 Sphinx-2 1992 SphinxL Sphinx-2L Sphinx-3.x Sphinx-3.7 PocketSphin x Carnegie Mellon Slide 41 Languge Technologies Insitute 1999 2006 Some features of Sphinx Level Features Acoustic Modeling CHMM-based representations Feature-space reduction (LDA/MLLT) Adaptive training (based on VTLN) Speaker adaptation (MLLR) Discriminative Training Real-Time Decoding GMM selection (4 level) Lexical trees; fixed-sized tree copying Parameter quantization Live interaction support End-point detection Dynamic lass-based language models Dynamic language models Partial hypotheses Confidence annotation Dynamic noise processing Carnegie Mellon Slide 42 Languge Technologies Insitute Spoken language processing generation synthesis action task & domain knowledge meaning speech recognition dialogue discourse understanding text Carnegie Mellon July 17, 2016 Slide 43 Speech Group Languge Technologies Insitute 43 A generic speech system human speech Signal Processing Parser Dialog Manager Language Generator Decoder post Parser Domain Domain Domain Agent agent agent Speech Synthesizer display effector machine speech Carnegie Mellon July 17, 2016 Slide 44 Speech Group Languge Technologies Insitute 44 Decoding speech Reduce dimensionality of signal Signal processing noise conditioning Decoder Lexical models Carnegie Mellon July 17, 2016 Transcribe speech to words Acoustic models Language models Slide 45 Speech Group Corpus-base statistical models Languge Technologies Insitute 45 A model for speech recognition Speech recognition as an estimation problem – A = sequence of acoustic observations – W = string of words Wˆ arg max P(W | A) W Carnegie Mellon July 17, 2016 Slide 46 Speech Group Languge Technologies Insitute 46 Recognition A Bayesian formulation of the recognition problem – the “fundamental equation” of speech recognition event evidence p(W | A) p(W ) p( A | W ) / p( A) posterior prior conditional The “model” Carnegie Mellon July 17, 2016 Slide 47 Speech Group Languge Technologies Insitute 47 Understanding speech Grammar Parser Post parser Context Carnegie Mellon July 17, 2016 Ontology design, language acquisition Extract semantic content from utterance Introduce context and world knowledge into interpretation Domain Agents Grounding, knowledge engineering Slide 48 Speech Group Languge Technologies Insitute 48 Interacting with the user Task schemas Context Dialog manager Domain Domain Domain agent agent agent Database Carnegie Mellon July 17, 2016 Live data (e.g. Web) Task analysis Guide interaction through task Map user inputs and system state into actions Interact with back-end(s) Interpret information using domain knowledge Domain expert Slide 49 Speech Group Knowledge engineering Languge Technologies Insitute 49 Communicating with the user Rules Database Language Decide what to say to user (and how to phrase it) Generator Speech Construct sounds and intonation synthesizer Database Carnegie Mellon July 17, 2016 Display Generator Action Generator Rules Slide 50 Speech Group Languge Technologies Insitute 50 Environmental robustness for the STS Expected types of disturbance: – White/broadband noise – Speech babble – Single competing speakers – Other background music Comments: – State-of-the art technologies have been applied mostly to applications where time and computation is unlimited – Samsung expects reverberation will not be a factor; if reverb actually is in environment, compensation becomes more difficult Carnegie Mellon Slide 51 Languge Technologies Insitute Speech recognition as pattern classification Speech features Word hypotheses Feature extraction Decision making procedure Major functional components: – Signal processing to extract features from speech waveforms – Comparison of features to pre-stored representations Important design choices: – Choice of features – Specific method of comparing features to stored representations Carnegie Mellon Slide 52 Languge Technologies Insitute Multistage environmental compensation for Samsung STC Compensation is a combination of: – ZCAE separates speech from noise based on direction of arrival – ANC cancels noise – CDCN, VTS, DCN provide static and dynamic noise removal, channel compensation, feature normalization – CBMF restores missing features that are lost due to transient noise Carnegie Mellon Slide 53 Languge Technologies Insitute Binaural processing for selective reconstruction (the ZCAE algorithm) Short-time Fourier analysis separates speech in time and frequency Separate time-frequency components according to direction of arrival based on time delay comparisons (ITD) Reconstruct signal from only the components that are believed to arise from desired source Missing feature restoration can smooth reconstruction Carnegie Mellon Slide 54 Languge Technologies Insitute Performance of zero-crossing-based amplitude estimation (ZCAE) Sample results with 45-degree arrival angle difference: Original Separated – With no reverberation: – With 300-ms reverb time: ZCAE provides excellent separation in “dry” rooms but fails in reverberation Carnegie Mellon Slide 55 Languge Technologies Insitute Adaptive noise cancellation (ANC) Speech + noise enters primary channel, correlated noise enters reference channel Adaptive filter attempts to convert noise in secondary channel to best resemble noise in primary channel and subtracts Performance degrades when speech leaks into reference channel and in reverberation Carnegie Mellon Slide 56 Languge Technologies Insitute Possible placement of mics for ZCAE and ANC One possible arrangement: – Two omni mics for ZCAE processing – One uni mic (notch toward speaker) for ANC reference Another option is to have reference mic on back, away from speaker Carnegie Mellon Slide 57 Languge Technologies Insitute “Classic” compensation for noise and filtering Model of Signal Degradation: Compensation procedures attempt to estimate characteristics of filter and additive noise, and then apply inverse operations Algorithms include CDCN, VTS, DCN Compensation applied at cepstral level (not to waveforms) Supersedes Wiener filtering Carnegie Mellon Slide 58 Languge Technologies Insitute Compensation results: WSJ task in white noise VTS provides ~7-dB improvement over baseline Carnegie Mellon Slide 59 Languge Technologies Insitute What does compensated speech “sound” like? Speech reconstructed after CDCN processing: 0 dB 20 dB Clean Original “Recovered” Signals that sound “better” do not produce lower error rates Carnegie Mellon Slide 60 Languge Technologies Insitute Missing-feature recognition General approach: – Determine which cells of a spectrogram-like display are unreliable (or “missing”) – Ignore missing features or make best guess about their values based on data that are present Comment: – Used in the STS for transient compensation and as a supporting technology for ZCAE Carnegie Mellon Slide 61 Languge Technologies Insitute Original speech spectrogram Carnegie Mellon Slide 62 Languge Technologies Insitute Spectrogram corrupted by noise at SNR 15 dB Some regions are affected far more than others Carnegie Mellon Slide 63 Languge Technologies Insitute Ignoring regions in the spectrogram that are corrupted by noise All regions with SNR less than 0 dB deemed missing (dark blue) Recognition performed based on colored regions alone Carnegie Mellon Slide 64 Languge Technologies Insitute Robustness Summary The CMU Robust Group will implement an ensemble of complementary technologies for the Samsung STS: – ZCAE for speech separation by arrival angle – Adaptive noise compensation to reduce background interference – CDCN, VTS, DCN to reduce stationary residual noise and provide channel compensation – Cluster-based missing-feature compensation to protect against transient interference and to improve ZCAE performance Most compensation procedures are based on a tight integration of environmental compensation with the core speech recognition engine. Carnegie Mellon Slide 65 Languge Technologies Insitute Virtual English Village (VEV) [SAMSUNG SLIDE] Service Overview – An online 3D virtual world providing language students with an opportunity to practice and develop conversational skills for pre-defined domains/scenarios. – Provide students with a fun, safe environment to develop conversational skills. – Improve students:» Grammar by correcting student input or suggesting other appropriate variations » Vocabulary by suggesting other words » Pronunciation through clear, synthesized speech and correcting students input – Service requires student to have access to a microphone. – A complement to existing English language services in Korea. Target Market – Initially targeting Korean Elementary to High-School Students (~8 to 16 year olds) with some familiarity of the English language. Potential Carnegie Contributions to VEG ASR: Sphinx speech recognition (CMU) TTS: Festival/Festbox text-to-speech (CMU) Let’s Go & Olympus dialog systems (CMU) Native Accent pronunciation tutor (Carnegie Speech) Grammar & lexicon tutors (Carnegie Speech) – Story Friends, Story Librarian, … – Intelligent tutoring architecture Language parsing/generation (CMU & CS) Carnegie Mellon Slide 67 Languge Technologies Insitute Let’s Go Performance Automated bus schedule/routes for Port Authority of Pittsburgh Answers phone every evening and on weekends since March 5, 2005 Presently over 52,000 calls from real users – Variety of noise and transmission conditions – Variety of dialects and other variants Estimated Task Success = about 75% Average number of turns = 12-14 Dialogue characteristics: – Mixed initiative after first turn – Open-ended first turn (“What can I do for you?”) – Change of topic upon comprehension problem (e.g. “what neighborhood are you in?”) Carnegie Mellon Slide 68 Languge Technologies Insitute Let’s Go Architecture ASR Sphinx II Parsing Phoenix Dialogue Management Olympus II HUB Galaxy Speech Synthesis Festival Carnegie Mellon NLG Rosetta Slide 69 Languge Technologies Insitute Olympus II dialogue architecture Improves timing awareness – separates DM decision making and awareness into separate modules – Improves turntaking – Higher success after barge-in – Reduces system cut-ins International spoken dialogue community uses OlympusII for different systems – About 10 different applications » Bus, air reservations, room reservations, movie line, personal minder, etc. Open Source with Documentation Carnegie Mellon Slide 70 Languge Technologies Insitute Olympus II architecture Carnegie Mellon Slide 71 Languge Technologies Insitute Speech synthesis Recognizer = Sphinx-II (enhanced, customized) Synthesizer = Festival – Well-known, open source – Provides very high quality limited domain synthesis » Some people calling Let’s Go think they are talking to a person Rosetta Spoken Language Generation – Carnegie Mellon Generate natural language sentences using templates Slide 72 Languge Technologies Insitute Spoken Dialogue and Language Learning Correction of any of several levels of language within a dialogue while keeping the dialogue on track – “I want to go the airport” – “Going to the airport, at what time?” Adaptation to non-native speech Uses F0 unit selection (Raux, Black ASRU 2003) for emphasis in synthesis Lexical entrainment for rapid correction – Carnegie Mellon Implicit strategy, no break in dialogue Slide 73 Languge Technologies Insitute Spoken Dialogue and Language Learning Lexical entrainment for rapid correction Collect speech in WOZ setup Model expected error Find closest user utterance ASR Hypothesis DP-based Alignment Prompt Selection Emphasis Confirmation Prompt Target Prompts Carnegie Mellon Slide 74 Languge Technologies Insitute