Machine Translation & Automated Speech Recognition Jaime Carbonell

advertisement
Machine Translation &
Automated Speech Recognition
Jaime Carbonell
With Richard Stern and Alex Rudnicky
Language Technologies Institute
Carnegie Mellon University
Telephone: +1 412 268 7279
Fax: +1 412 268 6298
jgc@cs.cmu.edu
For Samsung, February 20, 2008
Outline of Presentation
 Machine Translation (MT):
– Three major paradigms for MT (transfer, example-based, statistical)
– New method: Context-Based MT
 Automated Speech Recognition (ASR):
– The Sphinx family of recognizers
 Environmental Robustness for ASR
 Intelligent Tutoring for English
– Native Accent: Pronunciation product
– Grammar, Vocabulary & other tutotring
 Components for Virtual English Village
Carnegie
Mellon
Slide 2
Languge Technologies Insitute
Language Technologies
BILL OF RIGHTS
TECHNOLGY (e.g.)
 “…right information”
 IR (search engines)
 “…right people”
 Routing, personalization
 “…right time”
 Anticipatory analysis
 “…right medium”
 Speech: ASR, TTS
 “…right language”
 Machine translation
 “…right level of detail”
Carnegie
Mellon
 Summarization, Drill
Slide 3
down
Languge Technologies Insitute
Machine Translation:
Context is Needed to Resolve Ambiguity
Example: English  Japanese
Power line – densen (電線)
Subway line – chikatetsu (地下鉄)
(Be) on line – onrain (オンライン)
(Be) on the line – denwachuu (電話中)
Line up – narabu (並ぶ)
Line one’s pockets – kanemochi ni naru (金持ちになる)
Line one’s jacket – uwagi o nijuu ni suru (上着を二重にする)
Actor’s line – serifu (セリフ)
Get a line on – joho o eru (情報を得る)
Carnegie
Mellon
Sometimes local context suffices (as above)
. . . but sometimes not
Slide 4
Languge Technologies Insitute
CONTEXT: More is Better
 Examples requiring longer-range context:
– “The line for the new play extended for 3 blocks.”
– “The line for the new play was changed by the scriptwriter.”
– “The line for the new play got tangled with the other props.”
– “The line for the new play better protected the quarterback.”
 Better MT requires better context sensitivity
– Syntax-rule based MT  little or no context
– EBMT  potentially long context on exact substring matches
– SMT  short context with proper probabilistic optimization
– CBMT  long context with optimization (but still experimental)
Carnegie
Mellon
Slide 5
Languge Technologies Insitute
Types of Machine Translation
Interlingua
Semantic
Analysis
Syntactic
Parsing
Source
(Arabic)
Carnegie
Mellon
Sentence
Planning
Transfer Rules
Direct: SMT, EBMT
Slide 6
Text
Generation
Target
(English)
Languge Technologies Insitute
Example-Based Machine Translation Example
English:
I would like to meet her.
Mapudungun: Ayükefun trawüael
fey engu.
English:
The tallest man
is my father.
Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw.
English:
I would like to meet the tallest man
Mapudungun (new):
Ayükefun trawüael
Chi doy fütra chi wentru
Mapudungun (correct): Ayüken ñi trawüael
chi doy fütra wentruengu.
Carnegie
Mellon
Slide 7
Languge Technologies Insitute
Statistical Machine Translation:
The Three Ingredients
English
Language Model
E
p(E)
English-Korean
Translation
Model
K
p(K|E)
K
E*
Decoder
E*=arg max
Carnegie
Mellon
p(E|K)= argmax p(E) p(K|E)
Slide 8
Languge Technologies Insitute
Alignments
The
Les
proposal
Propositions
Will
ne
Not
seront
Now
Pas
Be
mises
En
Implemented
Application
maintenant
Translation models built in terms of hidden alignments
p(F|E)=  p( F , A | E )
A
Each English word “generates” zero or more French words.
Conditioning Joint Probabilities
p( F , A | E )  p(m  length( F ) | E 
m
j 1
j 1
p
(
a
|
a
,
f
 j 1 1 , m, E ) 
j 1
p( f j | a1j , f1 j 1 , m, E )
The modeling problem is how to approximate these terms to achieve:
Statistically robust estimates
Computationally efficient training
An effective and accurate model
Carnegie
Mellon
Slide 10
Languge Technologies Insitute
The Gang of Five IBM++ Models
A series of five models is sequentially trained to “bootstrap a detailed
translation model from parallel corpora:
1.
Word-byword translation. Parameters p(f|e) between pairs of words.
2.
Local
Addslengths)
alignment probabilities
p(aalignments.
j | j , sentence
3.
Fertilities and distortions. Allow a source word to generate zero or more
p(φp(j|i,sentence
( e) | e)
target words, then place words in position with
lengths).
4.
Tablets. Groups words into phrases
•
Primary idea of Example-Based MT incorporated into SMT
•
Currently seeking meaningful (vs arbitrary phrases), e.g. NPs. PPs.
5.
Carnegie
Mellon
Non-deficient alignments. Don’t ask. . .
Slide 11
Languge Technologies Insitute
Multi-Engine Machine Transaltion
 MT Systems have different strengths
– Rapidly adaptable: Statistical, example-based
– Good grammar: Rule-Based (linguisitic) MT
– High precision in narrow domains: KBMT
– Minority Language MT: Learnable from informant
 Combine results of parallel-invoked MT
– Select best of multiple translations
– Selection based on optimizing combination of:
» Target language joint-exponential model
» Confidence scores of individual MT engines
Carnegie
Mellon
Slide 12
Languge Technologies Insitute
Illustration of Multi-Engine MT
El punto de descarge
se cumplirá en
el puente Agua Fria
The drop-off point
will comply with
The cold Bridgewater
El punto de descarge
se cumplirá en
el puente Agua Fria
The discharge point
will self comply in
the “Agua Fria” bridge
El punto de descarge
se cumplirá en
el puente Agua Fria
Unload of the point
will take place at
the cold water of bridge
Carnegie
Mellon
Slide 13
Languge Technologies Insitute
Key Challenges for Corpus-Based MT
 Long-Range Context
 Parallel Corpora
– More is better
– More is better
– How to incorporate it?
– Where to find enough of it?
– How to do so tractably?
– What about rare languages?
– “I conjecture that MT is AI
complete” – H. Simon
– “There’s no data like more data” –
IBM SMT circa 1992
 State-of-the-Art
Carnegie
Mellon
 State-of-the-Art
– Trigrams  4 or 5-grams in LM
only (e.g., Google)
– Avoids rare languages (GALE only
does Arabic & Chinese)
– SMT  SMT+EBMT (phrase
tables, etc.)
– Symbolic rule-learning from
minimalist parallel corpus (AVENUE
project at CMU)
Slide 14
Languge Technologies Insitute
Parallel Text: Requiring Less is Better
(Requiring None is Best )
 Challenge
– There is just not enough to approach human-quality MT for most major
language pairs (we need ~100X more)
– Much parallel text is not on-point (not on domain)
– Rare languages and some distant pairs have virtually no parallel text
 Context-Based (CBMT) Approach
– Invented and patented by Meaningful Machines (in New York)
– Requires no parallel text, no transfer rules . . .
– Instead, CBMT needs
» A fully-inflected bilingual dictionary
» A (very large) target-language-only corpus
» A (modest) source-language-only corpus [optional, but preferred]
Carnegie
Mellon
Slide 15
Languge Technologies Insitute
CMBT System
Source Language
N-gramParser
Segmenter
Bilingual
Dictionary
Flooder
(non-parallel text method)
Target Corpora
[Source Corpora]
Gazetteers
Edge Locker
Cross-Language
N-gram Database
TTR
N-gram Candidates
Substitution
Request
Stored
N-gram
Pairs
Overlap-based Decoder
Target Language
Approved
N-gram
Pairs
Step 1: Source Sentence Chunking




Segment source sentence into overlapping n-grams via sliding window
Typical n-gram length 4 to 9 terms
Each term is a word or a known phrase
Any sentence length (for BLEU test: ave-27; shortest-8; longest-66 words)
S1
S2
S3
S4
S5
S6
S7
S8
S1
S2
S3
S4
S5
S2
S3
S4
S5
S6
S3
S4
S5
S6
S7
S4
S5
S6
S7
S8
S5
S6
S7
S8
S9
S9
Step 2: Dictionary Lookup
 Using bilingual dictionary, list all possible target translations for
each source word or phrase
Source Word-String
S2
S3
S4
S5
S6
Inflected Bilingual Dictionary
Target Word Lists
Flooding Set
T2-a
T2-b
T2-c
T2-d
T3-a
T3-b
T3-c
T4-a
T4-b
T4-c
T4-d
T4-e
T5-a
T6-a
T6-b
T6-c
Step 3: Search Target Text
 Using the Flooding Set, search target text for word-strings containing one
word from each group
Flooding Set
T2-a
T2-b
T2-c
T2-d
T3-a
T3-b
T3-c
T4-a
T4-b
T4-c
T4-d
T4-e
T5-a
T6-a
T6-b
T6-c
 Find maximum number of words from Flooding Set in minimum length wordstring
–
Words or phrases can be in any order
–
Ignore function words in initial step (T5 is a function word in this example)
Step 3: Search Target Text (Example)
Flooding Set
Target Corpus
Target
Candidate 1
T2-a
T2-b
T2-c
T2-d
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T3-a
T3-b
T3-c
T(x)
T3-b
T3-b
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T4-a
T4-b
T4-c
T4-d
T4-e
T(x)
T2-d
T2-d
T(x)
T(x)
T(x)
T(x)
T5-a
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T6-a
T6-b
T6-c
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T6-c
T6-c
T(x)
T(x)
T(x)
T(x)
Step 3: Search Target Text (Example)
Flooding Set
Target Corpus
Target
Candidate 2
T2-a
T2-b
T2-c
T2-d
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T3-a
T3-b
T3-c
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T4-a
T4-b
T4-c
T4-d
T4-e
T(x)
T(x)
T(x)
T4-a
T(x)
T4-a
T(x)
T(x)
T(x)
T(x)
T(x)
T6-b
T(x)
T6-b
T(x)
T(x)
T5-a
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T6-a
T6-b
T6-c
T(x)
T(x)
T(x)
T2-c
T(x)
T2-c
T(x)
T(x)
T(x)
T(x)
T(x)
T3-a
T(x)
T3-a
T(x)
T(x)
Step 3: Search Target Text (Example)
Flooding Set
Target Corpus
T2-a
T2-b
T2-c
T2-d
T(x)
T(x)
T(x)
T(x)
T(x)
T3-c
T3-c
T(x)
T3-a
T3-b
T3-c
T(x)
T(x)
T(x)
T(x)
T(x)
T2-b
T2-b
T(x)
T4-a
T4-b
T4-c
T4-d
T4-e
T(x)
T(x)
T(x)
T(x)
T(x)
T4-e
T4-e
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
T5-a
T5-a
T(x)
T5-a
T(x)
T(x)
T(x)
T(x)
T(x)
T6-a
T6-a
T(x)
T6-a
T6-b
T6-c
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
Target
Candidate 3
Reintroduce function words after initial match (e.g. T5)
T(x)
T(x)
T(x)
T(x)
T(x)
T(x)
Step 4: Score Word-String Candidates
 Scoring of candidates based on:
–
Proximity (minimize extraneous words in target n-gram  precision)
–
Number of word matches (maximize coverage  recall))
–
Regular words given more weight than function words
–
Combine results (e.g., optimize F1 or p-norm or …)
Target Word-String Candidates
T6-c
“Regular”
Word
Proximity
Matches
Words
Total
Scoring
Scoring
3rd
---3rd
T3-b
T(x)
T2-d
T(x)
T(x)
T4-a
T6-b
T(x)
T2-c
T3-a
1st1st
2st
---2nd
T3-c
T2-b
T4-e
T5-a
T6-a
---1st1st
Step 5: Select Candidates Using Overlap
(Propagate context over entire sentence)
Word-String 1
Candidates
Word-String 2
Candidates
Word-String 3
Candidates
T(x1)
T2-d
T3-c
T(x2)
T(x1)
T3-c
T2-b
T4-e
T(x2)
T4-a
T6-b
T(x3)
T4-b
T2-c
T3-b
T(x3)
T2-d
T(x5)
T(x6)
T4-a
T6-b
T(x3)
T2-c
T3-a
T3-c
T2-b
T4-e
T5-a
T6-a
T6-c
T2-b
T4-e
T5-a
T6-a
T(x8)
T6-b
T(x11)
T2-c
T3-a
T(x9)
T6-b
T(x3)
T2-c
T3-a
T(x8)
Step 5: Select Candidates Using Overlap
Best translations selected via maximal overlap
T(x2)
Alternative 1
T(x2)
T(x1)
Alternative 2
T(x1)
Carnegie
Mellon
T4-a
T6-b
T(x3)
T2-c
T4-a
T6-b
T(x3)
T2-c
T4-a
T3-a
T6-b
T(x3)
T2-c
T3-a
T(x8)
T6-b
T(x3)
T2-c
T3-a
T(x8)
T3-c
T2-b
T4-e
T3-c
T2-b
T4-e
T5-a
T6-a
T2-b
T4-e
T5-a
T6-a
T(x8)
T2-b
T4-e
T5-a
T6-a
T(x8)
T3-c
Slide 25
Languge Technologies Insitute
A (Simple) Real Example of Overlap
Flooding  N-gram fidelity
Overlap  Long range fidelity
a United States soldier
N-grams
generated
from
Flooding
United States soldier died
soldier died and two others
died and two others were injured
two others were injured Monday
N-grams connected
via Overlap
Systran
a United States soldier died and two others were injured Monday
A soldier of the wounded United States died and other two were east Monday
Comparative System Scores
0.85
0.8
Human Scoring Range
BLEU SCORES
0.7209
0.7447
0.7533
0.7
0.6
0.5551
0.5610
Systran
Spanish
SDL
Spanish
0.5137
0.5
0.4
0.3859
0.3
Google
Google
Chinese
Arabic
(‘06 NIST) (‘05 NIST)
MM
Google
Spanish
Spanish
’08 top lang
Based on same Spanish test set
MM
Spanish
(Non-blind)
Historical CBMT Scoring
0.85
BLEU SCORES
0.8
Human Scoring Range
.6456
0.6
.6144
.6374
.6165
.5670
.5953
.6129
.7354
.7365
.7447
.7059
.6950
0.7
.7533
.6645
.6694
.6929
.7276
.6393
.6267
Blind
0.5
Non-Blind
0.4
.3743
0.3
Apr 04 Aug 04
Dec 04
Apr 05
Aug 05
Dec 05
Apr 06
Aug 06
Dec 06
Apr 07 Aug 07 Dec 07
Illustrative Translation Examples
Un coche bomba estalla junto a una comisaría de policía en Bagdad
MM:
A car bomb explodes next to a police station in Baghdad
Systran:
A car pump explodes next to a police station of police in bagdad
Hamas anunció este jueves el fin de su cese del fuego con Israel
MM:
Hamas announced Thursday the end of the cease fire with israel
Systran:
Hamas announced east Thursday the aim of its cease-fire with Israel
Testigos dijeron haber visto seis jeeps y dos vehículos blindados
MM:
Witnesses said they have seen six jeeps and two armored vehicles
Systran:
Six witnesses said to have seen jeeps and two armored vehicles
A More Complex Example
 Un soldado de Estados Unidos murió y otros dos resultaron heridos este lunes
por el estallido de un artefacto explosivo improvisado en el centro de Bagdad,
dijeron funcionarios militares estadounidenses.
 MM: A United States soldier died and two others were injured monday by the
explosion of an improvised explosive device in the heart of Baghdad, American
military officials said.
 Systran: A soldier of the wounded United States died and other two were east
Monday by the outbreak from an improvised explosive device in the center of
Bagdad, said American military civil employees.
Summing up CBMT
 What if a source word or phrase is not in the bilingual dictionary?
– Automatically Find near synonyms in source,
– Replace and retranslate
 Current Status
– Developing a Chinese-to-English CBMT system (no results yet)
– Pre-commercial (field-testing)
– CMU participating in improving CBMT technology
 Advantages
– Most accurate MT method in existance
– Does not require parallel text
 Disadvantages
– Run time speed: 20s/w (7/07)  1.5s/w (2/08)  0.1s/w(12/08)
– Requires heavy-duty servers (not yet miniturizable)
Flat Seed Generation
The highly qualified applicant visits the company.
Der äußerst qualifizierte Bewerber besucht die Firma.
((1,1),(2,2),(3,3),(4,4),(5,5),(6,6))
S::S [det adv adj n v det n]
[det adv adj n v det n]
((x1::y1) (x2::y2)….
((x4 agr) = *3-sing) …
((y3 case) = *nom)…)
Compositionality
NP::NP [det adv adj n]
Adjust rule to
[det adv adj n]
reflect
((x1::y1)…
compositionality
((y4 agr) = (x4 agr)
….)
S::S [NP v det n]
NP rule can be used to
[NP n v det n]
translate part of the
sentence; keep / add
((x1::y1) (x2::y2)….
context constraints,
Carnegie …
eliminate unnecessary
Slide 32
((y1Mellon
case) = *nom)…)
ones
Goal: Syntactic Transfer Rules
1) Flat Seed Generation: produce rules
from word-aligned sentence pairs,
abstracted only to POS level; no
syntactic structure
2) Add compositional structure to Seed Rule
by exploiting previously learned rules
3) Seeded Version Space Learning group
seed rules by constituent sequences
and alignments, seed rules form
s-boundary of VS; generalize with
validation
Seeded Version Space Learning
Group seed rules into version spaces:
…
…
…
Merge two rules:
1) Deletion of constraint
Notes:
2) Raising of two value to one
1) Partial order of
Agreement constraint, e.g.
rules in VS
((x1 num) = *pl), ((x3 num) = *pl)
2) Generalization
((x1Insitute
num) = (x3 num)
Languge Technologies
via merging
3) Use merged rule to translate
NP v det n
Version Spaces (Mitchell, 1980)
Anything
G boundary
b
“Target” concept
S boundary
Specific
Instances
Carnegie
Mellon
Slide 33
Languge Technologies Insitute
N
Original & Seeded Version Spaces
 Version-spaces (Mitchell, 1980)
– Symbolic multivariate learning
– S & G sets define lattice boundaries
– Exponential worst-case: O(bN)
 Seeded Version Spaces (Carbonell, 2002)
– Generality level  hypothesis seed
– S & G subsets  effective lattice
– Polynomial worst case: O(bk/2), k=3,4
Carnegie
Mellon
Slide 34
Languge Technologies Insitute
Seeded Version Spaces (Carbonell, 2002)
Xn  Ym
G boundary
Det Adj N  Det N Adj
(Y2 num) = (Y3 num)
(Y2 gen) = (Y3 gen)
(X3 num) = (Y2 num)
“Target” concept
S boundary
“The big book” 
“ el libro grande”
Carnegie
Mellon
Slide 35
Languge Technologies Insitute
N
Seeded Version Spaces
Xn  Ym
G boundary
k
Seed
(best guess)
“Target” concept
S boundary
“The big book” 
“ el libro grande”
Carnegie
Mellon
Slide 36
Languge Technologies Insitute
N
Transfer Rule Learning
 One SVS per transfer rule
– Alignment + generalization level = seed
– Apply SVS(seed,k) learning method
» High confidence seed  small k [k too small increases
prob(failure)]
» Low confidence seed  larger k [k too large increases
computation & elicitation]
 For translation use SVS f-centroid
– If success % > threshold, halt SVS
– If no rule found, k := K+1 & redo (to max K)
Carnegie
Mellon
Slide 37
Languge Technologies Insitute
Sample learned Rule
Carnegie
Mellon
Slide 38
Languge Technologies Insitute
Compositionality
;;L2: h &niin h rxb b h bxirwt
Training example
;;L1: the widespread interest in the election
NP::NP
Rule type
[Det N Det Adj PP] -> [Det Adj N PP]
Component sequences
((X1::Y1)(X2::Y3)
(X3::Y1)(X4::Y2)
Component alignments
(X5::Y4)
((Y3 num) = (X2 num))
Agreement constraints
((X2 num) = sg)
((X2 gen) = m)
((Y1 def) = +) )
Value constraints
Automated Speech Recognition at CMU:
Sphinx and related technologies
 Sphinx is the name for a family of speech recognition engines –
developed by Reddy, Lee & Rudnicky
– First continuous-speech, speaker-independent large vocabulary system
– Standard continuous HMM technology
– Multiple languages
– Open source code base
– Miniaturized and integrated with MT, dialog, and TTS
– Basis for Native Accent & Carnegies Speech prouducts
 Janus system developed–by Waibel and Schultz
– Started as Neural-Net Approach
– Now standard continuous HMM system
– Multiple languages
– NOT open source code base
– Held WER advantage over Sphinx till recently (now both are equal)
Carnegie
Mellon
Slide 40
Languge Technologies Insitute
Sphinx ASR at Carnegie Mellon



Sphinx systems have evolved along
two parallel batch
Live ASR
Technology innovation occurs in the
Research branch, then transitions to
the Live branch
The Live branch focuses on the
additional requirements of interactive
real-time recognition and on the
integration with higher-level
components such as spoken language
understanding and dialog
management
Research ASR
Sphinx
1986
Sphinx-2
1992
SphinxL
Sphinx-2L
Sphinx-3.x
Sphinx-3.7
PocketSphin
x
Carnegie
Mellon
Slide 41
Languge Technologies Insitute
1999
2006
Some features of Sphinx
Level
Features
Acoustic Modeling
CHMM-based representations
Feature-space reduction (LDA/MLLT)
Adaptive training (based on VTLN)
Speaker adaptation (MLLR)
Discriminative Training
Real-Time Decoding
GMM selection (4 level)
Lexical trees; fixed-sized tree copying
Parameter quantization
Live interaction support
End-point detection
Dynamic lass-based language models
Dynamic language models
Partial hypotheses
Confidence annotation
Dynamic noise processing
Carnegie
Mellon
Slide 42
Languge Technologies Insitute
Spoken language processing
generation
synthesis
action
task & domain
knowledge
meaning
speech
recognition
dialogue discourse
understanding
text
Carnegie
Mellon
July 17, 2016
Slide
43
Speech
Group
Languge Technologies Insitute
43
A generic speech system
human speech
Signal
Processing
Parser
Dialog
Manager
Language
Generator
Decoder
post
Parser
Domain
Domain
Domain
Agent
agent
agent
Speech
Synthesizer
display
effector
machine
speech
Carnegie
Mellon
July 17, 2016
Slide
44
Speech
Group
Languge Technologies Insitute
44
Decoding speech
Reduce dimensionality of signal
Signal
processing noise conditioning
Decoder
Lexical
models
Carnegie
Mellon
July 17, 2016
Transcribe speech to words
Acoustic
models
Language
models
Slide
45
Speech
Group
Corpus-base
statistical models
Languge Technologies Insitute
45
A model for speech recognition
 Speech recognition as an estimation problem
– A = sequence of acoustic observations
– W = string of words
Wˆ  arg max P(W | A)
W
Carnegie
Mellon
July 17, 2016
Slide
46
Speech
Group
Languge Technologies Insitute
46
Recognition
 A Bayesian formulation of the recognition problem
– the “fundamental equation” of speech recognition
event
evidence
p(W | A)  p(W )  p( A | W ) / p( A)
posterior
prior
conditional
The “model”
Carnegie
Mellon
July 17, 2016
Slide
47
Speech
Group
Languge Technologies Insitute
47
Understanding speech
Grammar
Parser
Post
parser
Context
Carnegie
Mellon
July 17, 2016
Ontology design, language acquisition
Extract semantic content from utterance
Introduce context and world knowledge into interpretation
Domain
Agents
Grounding, knowledge engineering
Slide
48
Speech
Group
Languge Technologies Insitute
48
Interacting with the user
Task
schemas
Context
Dialog
manager
Domain
Domain
Domain
agent
agent
agent
Database
Carnegie
Mellon
July 17, 2016
Live data
(e.g. Web)
Task analysis
Guide interaction through task
Map user inputs and system state into actions
Interact with back-end(s)
Interpret information using domain knowledge
Domain
expert
Slide
49
Speech
Group
Knowledge engineering
Languge Technologies Insitute
49
Communicating with the user
Rules
Database
Language Decide what to say to user (and how to phrase it)
Generator
Speech Construct sounds and intonation
synthesizer
Database
Carnegie
Mellon
July 17, 2016
Display
Generator
Action
Generator
Rules
Slide
50
Speech
Group
Languge Technologies Insitute
50
Environmental robustness for the STS
 Expected types of disturbance:
– White/broadband noise
– Speech babble
– Single competing speakers
– Other background music
 Comments:
– State-of-the art technologies have been applied mostly to applications
where time and computation is unlimited
– Samsung expects reverberation will not be a factor; if reverb actually is
in environment, compensation becomes more difficult
Carnegie
Mellon
Slide 51
Languge Technologies Insitute
Speech recognition as pattern classification
Speech features
Word
hypotheses
Feature extraction
Decision making
procedure
 Major functional components:
– Signal processing to extract features from speech waveforms
– Comparison of features to pre-stored representations
 Important design choices:
– Choice of features
– Specific method of comparing features to stored representations
Carnegie
Mellon
Slide 52
Languge Technologies Insitute
Multistage environmental compensation
for Samsung STC
 Compensation is a combination of:
– ZCAE separates speech from noise based on direction of arrival
– ANC cancels noise
– CDCN, VTS, DCN provide static and dynamic noise removal, channel
compensation, feature normalization
– CBMF restores missing features that are lost due to transient noise
Carnegie
Mellon
Slide 53
Languge Technologies Insitute
Binaural processing for selective
reconstruction (the ZCAE algorithm)
 Short-time Fourier analysis separates speech in time and frequency
 Separate time-frequency components according to direction of arrival
based on time delay comparisons (ITD)
 Reconstruct signal from only the components that are believed to
arise from desired source
 Missing feature restoration can smooth reconstruction
Carnegie
Mellon
Slide 54
Languge Technologies Insitute
Performance of zero-crossing-based amplitude
estimation (ZCAE)
 Sample results with 45-degree arrival angle difference:
Original Separated
– With no reverberation:
– With 300-ms reverb time:
 ZCAE provides excellent separation in “dry” rooms but fails in
reverberation
Carnegie
Mellon
Slide 55
Languge Technologies Insitute
Adaptive noise cancellation (ANC)
 Speech + noise enters primary channel, correlated noise enters
reference channel
 Adaptive filter attempts to convert noise in secondary channel to best
resemble noise in primary channel and subtracts
 Performance degrades when speech leaks into reference channel and
in reverberation
Carnegie
Mellon
Slide 56
Languge Technologies Insitute
Possible placement of mics for ZCAE and ANC
 One possible arrangement:
– Two omni mics for ZCAE processing
– One uni mic (notch toward speaker) for ANC reference
 Another option is to have reference mic on back, away from
speaker
Carnegie
Mellon
Slide 57
Languge Technologies Insitute
“Classic” compensation for noise and filtering
 Model of Signal Degradation:
 Compensation procedures attempt to estimate characteristics
of filter and additive noise, and then apply inverse operations
 Algorithms include CDCN, VTS, DCN
 Compensation applied at cepstral level (not to waveforms)
 Supersedes Wiener filtering
Carnegie
Mellon
Slide 58
Languge Technologies Insitute
Compensation results: WSJ task in white noise
 VTS provides ~7-dB improvement over baseline
Carnegie
Mellon
Slide 59
Languge Technologies Insitute
What does compensated speech “sound” like?
 Speech reconstructed after CDCN processing:
0 dB 20 dB Clean
Original
“Recovered”
 Signals that sound “better” do not produce lower error rates
Carnegie
Mellon
Slide 60
Languge Technologies Insitute
Missing-feature recognition
 General approach:
– Determine which cells of a spectrogram-like display are unreliable (or
“missing”)
– Ignore missing features or make best guess about their values based on
data that are present
 Comment:
– Used in the STS for transient compensation and as a supporting
technology for ZCAE
Carnegie
Mellon
Slide 61
Languge Technologies Insitute
Original speech spectrogram
Carnegie
Mellon
Slide 62
Languge Technologies Insitute
Spectrogram corrupted by noise at SNR 15 dB
 Some regions are affected far more than others
Carnegie
Mellon
Slide 63
Languge Technologies Insitute
Ignoring regions in the spectrogram that are
corrupted by noise

All regions with SNR less than 0 dB deemed missing (dark blue)

Recognition performed based on colored regions alone
Carnegie
Mellon
Slide 64
Languge Technologies Insitute
Robustness Summary
 The CMU Robust Group will implement an ensemble of
complementary technologies for the Samsung STS:
– ZCAE for speech separation by arrival angle
– Adaptive noise compensation to reduce background interference
– CDCN, VTS, DCN to reduce stationary residual noise and provide
channel compensation
– Cluster-based missing-feature compensation to protect against transient
interference and to improve ZCAE performance
 Most compensation procedures are based on a tight
integration of environmental compensation with the core
speech recognition engine.
Carnegie
Mellon
Slide 65
Languge Technologies Insitute
Virtual English Village (VEV) [SAMSUNG SLIDE]
 Service Overview
– An online 3D virtual world providing language students with an opportunity
to practice and develop conversational skills for pre-defined
domains/scenarios.
– Provide students with a fun, safe environment to develop conversational
skills.
– Improve students:» Grammar  by correcting student input or suggesting other
appropriate variations
» Vocabulary  by suggesting other words
» Pronunciation  through clear, synthesized speech and correcting
students input
– Service requires student to have access to a microphone.
– A complement to existing English language services in Korea.
 Target Market
– Initially targeting Korean Elementary to High-School Students (~8 to 16
year olds) with some familiarity of the English language.
Potential Carnegie Contributions to VEG
 ASR: Sphinx speech recognition (CMU)
 TTS: Festival/Festbox text-to-speech (CMU)
 Let’s Go & Olympus dialog systems (CMU)
 Native Accent pronunciation tutor (Carnegie Speech)
 Grammar & lexicon tutors (Carnegie Speech)
– Story Friends, Story Librarian, …
– Intelligent tutoring architecture
 Language parsing/generation (CMU & CS)
Carnegie
Mellon
Slide 67
Languge Technologies Insitute
Let’s Go Performance

Automated bus schedule/routes for Port Authority of Pittsburgh

Answers phone every evening and on weekends since March 5, 2005

Presently over 52,000 calls from real users
–
Variety of noise and transmission conditions
–
Variety of dialects and other variants

Estimated Task Success = about 75%

Average number of turns = 12-14

Dialogue characteristics:
–
Mixed initiative after first turn
–
Open-ended first turn (“What can I do for you?”)
–
Change of topic upon comprehension problem (e.g. “what neighborhood are
you in?”)
Carnegie
Mellon
Slide 68
Languge Technologies Insitute
Let’s Go Architecture
ASR
Sphinx II
Parsing
Phoenix
Dialogue
Management
Olympus II
HUB
Galaxy
Speech Synthesis
Festival
Carnegie
Mellon
NLG
Rosetta
Slide 69
Languge Technologies Insitute
Olympus II dialogue architecture


Improves timing awareness – separates DM decision making and
awareness into separate modules
–
Improves turntaking
–
Higher success after barge-in
–
Reduces system cut-ins
International spoken dialogue community uses OlympusII for
different systems
–
About 10 different applications
»

Bus, air reservations, room reservations, movie line, personal minder, etc.
Open Source with Documentation
Carnegie
Mellon
Slide 70
Languge Technologies Insitute
Olympus II architecture
Carnegie
Mellon
Slide 71
Languge Technologies Insitute
Speech synthesis
 Recognizer = Sphinx-II (enhanced, customized)
 Synthesizer = Festival
–
Well-known, open source
–
Provides very high quality limited domain synthesis
»
Some people calling Let’s Go think they are talking to a person
 Rosetta Spoken Language Generation
–
Carnegie
Mellon
Generate natural language sentences using templates
Slide 72
Languge Technologies Insitute
Spoken Dialogue and Language Learning

Correction of any of several levels of language within a dialogue
while keeping the dialogue on track
–
“I want to go the airport”
–
“Going to the airport, at what time?”

Adaptation to non-native speech

Uses F0 unit selection (Raux, Black ASRU 2003) for emphasis in
synthesis

Lexical entrainment for rapid correction
–
Carnegie
Mellon
Implicit strategy, no break in dialogue
Slide 73
Languge Technologies Insitute
Spoken Dialogue and Language Learning

Lexical entrainment for rapid correction

Collect speech in WOZ setup

Model expected error

Find closest user utterance
ASR
Hypothesis
DP-based
Alignment
Prompt
Selection
Emphasis
Confirmation
Prompt
Target
Prompts
Carnegie
Mellon
Slide 74
Languge Technologies Insitute
Download