Spoken and Multimodal Dialog Technology and Systems

advertisement
Speech Recognition and
Understanding
Alex Acero
Microsoft Research
Thanks to Mazin Rahim (AT&T)
“A Vision into the 21st Century”
Milestones in Speech Recognition
Small
Vocabulary,
Acoustic
Phoneticsbased
Isolated
Words
Isolated Words
Connected Digits
Continuous Speech
Filter-bank analysis
Time-normalization
Dynamic programming
1967
Spoken dialog;
Multiple
modalities
Stochastic language
understanding
Finite-state machines
Statistical learning
Concatenative synthesis
Machine learning
Mixed-initiative dialog
Hidden Markov models
Stochastic Language
modeling
1972
Very Large
Vocabulary;
Semantics,
Multimodal
Dialog
Continuous Speech
Speech
Understanding
Connected Words
Continuous Speech
Pattern recognition
LPC analysis
Clustering algorithms
Level building
1962
Large
Vocabulary;
Syntax,
Semantics,
Medium
Large
Vocabulary,
Vocabulary,
Template-based Statistical-based
1977
1982
Year
1987
1992
1997
2003
Multimodal System
Technology Components
Speech
Speech
Pen
Gesture
Visual
TTS
ASR
Text-to-Speech
Synthesis
Data,
Rules
Words
Spoken Language
Generation
SLG
Action
Words
SLU
DM
Dialog
Management
Automatic Speech
Recognition
Meaning
Spoken Language
Understanding
Voice-enabled System
Technology Components
Speech
Speech
TTS
ASR
Text-to-Speech
Synthesis
Data,
Rules
Words
Spoken Language
Generation
SLG
Action
Words
SLU
DM
Dialog
Management
Automatic Speech
Recognition
Meaning
Spoken Language
Understanding
Automatic Speech Recognition
• Goal
– convert a speech signal into a text message
– Accurately and efficiently
– independent of the device, speaker or the
environment.
• Applications
– Accessibility
– Eyes-busy hands-busy (automobile, doctors, etc)
– Call Centers for customer care
– Dictation
Basic Formulation
• Basic equation of speech recognition is
^
W  arg max P( W | X)  arg max
w
•
•
•
•
w
P ( W ) P ( X | W)
 arg max P( W) P( X | W)
P( X)
w
X=X1,X2,…,Xn is the acoustic observation
ˆ  w w ...w is the word sequence.
W
1 2
m
P(X|W) is the acoustic model
P(W) is the language model
TTS
ASR
SLG
SLU
Speech Recognition
Process
DM
Acoustic
Model
Input
Speech
Feature
Extraction
Pattern
Classification
(Decoding,
Search)
Language
Model
Word
Lexicon
Confidence “Hello World”
Scoring
(0.9) (0.8)
Feature Extraction
Acoustic
Model
Feature
Extraction
Pattern
Classification
Goal:
•Extract robust features relevant for ASR
Language
Model
Method:
•Spectral analysis
Result:
•A feature vector every 10ms
Challenges:
• Robustness to environment (office, airport, car)
• Devices (speakerphones, cellphones)
• Speakers (accents, dialect, style, speaking defects)
Word
Lexicon
Confidence
Scoring
Spectral Analysis
• Female speech (/aa/, pitch of 200Hz)
N 1
• Fourier transform
X [k ]   x[n]e j 2 nk / N
n 0
• 30ms Hamming Window
x[n]: time signal
X[k]: Fourier transform
Spectrograms
• Short-time Fourier transform
• Pitch and formant structure
0.5
0
-0.5
0
0.1
0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.3
0.4
Time (seconds)
0.5
0.6
Frequency (Hz)
4000
3000
2000
1000
0
Feature Extraction Process
s (t )
Noise
removal,
Normalization
Quantization
s (n)

Preemphasis
M,N
Segmentation Energy
(blocking) Zero-crossing
W (n)
s ' m ( n)
Windowing
s" m(n)
Spectral
Analysis
a' m(l )
Cepstral
Analysis
am (l )
s ' ( n)
Filtering
cm (l )
Pitch
Formants
Equalization
Bias removal
or normalization
c ' m (l )
Temporal
Derivative
Delta cepstrum
Delta^2 cepstrum
 c ' m (l )
Robust Speech Recognition
A mismatch in the speech signal between the training
phase and testing phase results in performance
degradation.
Training
Signal
Enhancement
Testing
Signal
Features
Normalization
Features
Model
Adaptation
Model
Noise and Channel Distortion
y(t) = [s(t) + n(t)] * h(t)
Noise
n(t)
h(t)
+
Channel
Distorted Speech
Speech
y(t)
s(t)
50dB
Fourier
Transform
5dB
Fourier
Transform
frequency
frequency
Speaker Variations
• Vocal tract length varies from 15-20cm
• Longer vocal tracts =>lower frequency contents
• Maximum Likelihood Speaker Normalization
– Warp the frequency of a signal
 '  arg max Pr( |   X )

0.87    1.17
Acoustic Modeling
Acoustic
Model
Feature
Extraction
Pattern
Classification
Language
Model
Confidence
Scoring
Word
Lexicon
Goal:
•Map acoustic features into distinct subword units
•Such as phones, syllables, words, etc.
Hidden Markov Model (HMM):
•Spectral properties modeled by a parametric random process
•A collection of HMMs is associated with a subword unit
•HMMs are also assigned for modeling extraneous events
Advantages:
•Powerful statistical method for a wide range of data and conditions
•Highly reliable for recognizing speech
Discrete-Time Markov Process
• The Dow Jones Industrial Average
0.6
0.3
0.2
1
up
down
0.5
0.4
0.2
0.6
L
An
a s M
0.5
M
M
0.4
N
ij
0.1
0.2
2
01
.
unch.
3
Discrete-time first order Markov chain
0.5
O
P
P
0.5P
Q
0.2 0.2
0.3 0.2
Hidden Markov Models
0.6
0.6
L
An
a s M
0.5
M
M
0.4
N
0.3
ij
 0.7
 
.
 01
 0.2
0.2
1
 01
.
 
 0.6
 0.3
2
Bull
Bear
0.5
0.4
0.2
0.1
0.2
3
 0.3
 
 0.3
 0.4
 05
. 
 
initial state prob. =  0.2
 0.3
Steady


P(up)

P(down) 
 P(unchanged )
output 
pdf = 
0.5
O
P
P
0.5P
Q
0.2 0.2
0.3 0.2
01
.
Example
• I observe (up, up, down, up, unchanged, up)
• Is it a bull market? bear market?
P(bull)=0.7*0.7*0.1*0.7*0.2*0.7*[0.5* (0.6)5]=1.867*10-4
P(bear)=0.1*0.1*0.6*0.1*0.3*0.1*[0.2* (0.3)5]=8.748*10-9
P(steady)=0.3*0.3*0.3*0.3*0.4*0.3*[0.3*(0.5)5]=9.1125*10-6
• It’s 20 times more likely that we are in a bull
market than a steady market!
• How about
P(bull,bull,bear,bull,steady,bull)=
=(0.7*0.7*0.6*0.7*0.4*0.7)*(0.5*0.6*0.2*0.5*0.2*0.4)=1.382976*10-4
Basic Problems in HMMs
Given acoustic observation X and model :
Evaluation: Compute P(X | )
Decoding: Choose optimal state sequence
Re-estimation: Adjust  to maximize P(X |)
Evaluation:
Forward-Backward algorithm
N

 t (i)  P( X , st  i | )    t 1 ( j )a ji  bi ( X t )
 j 1

N


t ( j )  P( X tT1 | st  j , )    a ji bi ( X t 1 ) t 1 (i) 
 i 1

t
1
t -2
t-1
s1
a1i
s2
a2i
si
sj
aNi
aijbj(Xt)
t =T -11; 1  j  N
aj1
s1
aj2
s2
aj3
a3i
s3
s3
ajN
sN
t-2(i)
t-1(i)
t(j)
Backward
 t (i, j )  P( st 1  i, st  j | X 1T , )
P( st 1  i, st  j , X 1T | )

P( X 1T | )

 t 1 (i)aij b j ( X t ) t ( j )
N

k 1
sN
Forward
t+1
t
Xt
2  t  T; 1  i  N
t+1(j)
T
(k )
Decoding: Viterbi Algorithm
Step 1: Initialization
D1(i)=πibi(x1), B1(i)=0
j=1,…N
Step 2: Iterations
for t=2,…,T {
for j=1,…,N {
Vt(j)=min[Vt-1(i)aij ] bj(xt)
Bt(j)=argmin[Vt-1(i)aij ] }}
Step 3: Backtracking
The optimal score is VT = max Vt(i)
Final state is sT= argmax Vt(i)
Optimal path is (s1,s2,…,sT) where
st=Bt+1(st+1) t=T-1,…1
j+1
j
j -1
Vt-1(j+1)
Vt-1(j-1)
aj j
Vt(j)
bj(xt)
aj-1 j
sM
b2(2)
s2
aj+1 j
Vt-1(j)
Optimal alignment
between X and S
s1
x1
t -1
t
x2
xT
Reestimation:
Baum-Welch Algorithm
• Find =(A, B, ) that maximize
p(X|
):
T
P(X | )   P( X, S | )   ast1st bst ( X t )
S
S
t 1
• No closed-form solution =>
• EM algorithm:
– Start with old parameter value 
– Obtain a new parameter ̂ that maximizes
P( X, S | )
ˆ
ˆ )  Q (, aˆ )  Q (, bˆ )
Q(, )  
log P( X, S | 
ai
i
bj
j
P
(
X
|

)
all S
• EM guaranteed to have higher likelihood
Continuous Densities
• Output distribution is mixture of Gaussians
M
M
k 1
k 1
b j (x)   c jk N (x, μ jk , Σ jk )   c jk b jk (x)
• Posterior probabilities of state j at time t, (mixture k
and state i at time t - 1):
 t 1 (i)aij b j ( X t ) t ( j )
 ( m) a c b ( x )  ( j )


(
i
,
j
)

 ( j, k ) 
t
N
 T ( m)

  ( m)
m 1
m
t 1
mj
t
jk
jk
t
t
N
T
m 1
• Reestimation Formulae:
T
T
aˆij 
  (i, j )
T
t 1
N
t
  (i, k )
t 1 k 1
t
cˆ jk 
  ( j, k )
t
t 1
T
 ( j, k )
t 1
t
k
T
T
μˆ jk 
  ( j, k )x
t 1
T
t
  ( j, k )
t 1
t
t
ˆ 
Σ
jk
  ( j, k )(x
t 1
t
t
 μˆ jk )(xt  μˆ jk )t
T
  ( j, k )
t 1
t
EM Training Procedure
Input speech
database
Old
HMM Model
Estimate
Posteriors
t(j,k), t(i,j)
Maximize
parameters
aij, cjk, jk, jk
Updated
HMM Model
Design Issues
•
•
•
•
•
Continuous vs. Discrete HMM
Whole-word vs. subword (phone units)
Number of states, number of Gaussians
Ergodic vs. Bakis
Context-dependent vs. context-independent
0
1
2
a
Training with continuous
speech
• No segmentation is needed
• Composed HMM
/sil/
/sil/
one
three
/sil/
Context Variability in Speech
•At word/sentence level:
Mr. Wright should write to Ms. Wright right
away about his Ford or four door Honda.
Peat
Wheel
2000
1000
• At phone level:
-1000
-2000
-2000
0
• Triphones capture:
0.1
0.2
0.3
4000
0
0.1
0
0.1
0.2
0.3
Time (seconds)
0.2
0.3
4000
Frequency (Hz)
Frequency (Hz)
– phonetic context
0
-1000
/ee/ for words peat and wheel
– Coarticulation
1000
0
3000
2000
1000
0
3000
2000
1000
0
0
0.1
0.2
0.3
Time (seconds)
Context-dependent models
Triphone IY(P, CH) captures coarticulation,
phonetic context
Stress:
Italy vs Italian
1000
2000
500
1000
0
0
-500
-1000
-1000
-2000
-1500
-3000
0
0.1
0.2
0.3
0
0.4
0.6
4000
Frequency (Hz)
Frequency (Hz)
4000
0.2
3000
2000
1000
0
3000
2000
1000
0
0
0.1
0.2
0.3
Time (seconds)
0
0.2
0.4
Time (seconds)
0.6
Clustering similar triphones
• /iy/ with two different left contexts /r/ and /w/
• Similar effects on /iy/
• Cluster those triphones together
400
400
200
200
0
0
-200
-200
-400
-400
-600
-600
0
0.1
0.2
0.3
0.4
3000
2000
1000
0
0.1
0.2
0.3
0.4
0
0.1 0.2
0.3
0.4
Time (seconds)
4000
Frequency (Hz)
Frequency (Hz)
4000
0
0
0.1 0.2 0.3 0.4
Time (seconds)
3000
2000
1000
0
Clustering with decision trees
welcome
Is left phone a sonorant or nasal?
yes
Is right phone a back-R?
Is left phone /s,z,sh,zh/?
no
Is right phone voiced?
senone 1
senone 5 senone 6
yes
Is left phone a back-L or
(is left phone neither a nasal nor a Yglide and right phone a LAX-vowel)?
senone 4
yes
senone 2
senone 3
Other variability in Speech
• Style:
– discrete vs. continuous speech,
– read vs spontaneous
– slow vs fast
• Speaker:
– speaker independent
– speaker dependent
– speaker adapted
• Environment:
– additive noise (cocktail party effect)
– telephone channel
Acoustic Adaptation
• Model adaptation needed if
– Mismatched test conditions
– Desire to tune to a given speaker
• Maximum a Posteriori (MAP)
– adds a prior for the parameters 
ˆ  arg max [ p( | X)]  arg max [ p(X | ) p()]

• Maximum Likelihood Linear Regression (MLLR)
– Transform mean vectors μ%ik  Wc μik
– Can have more than one MLLR transform
– Speaker Adaptive Training (SAT) applies MLLR to training data
as well
MAP vs MLLR
Speaker-dependent system is trained with 1000 sentences
Discriminative Training
• Maximum Likelihood Training:
– Parameters obtained from true classes
• Discriminative Training:
– maximize discrimination between classes
Discriminative Feature Transformation
– Maximize inter-class difference to intra-class difference
– Done at the state level
– Linear Discriminant Analysis (LDA)
Discriminative Model Training
– Maximize posterior probability
– Correct class and competing classes are used
– Maximum mutual information (MMI); Minimum classification
Error (MCE), Minimum Phone Error (MPE)
Word Lexicon
Goal:
Map legal phone sequences into words
according to phonotactic rules:
David
/d/ /ey/ /v/ /ih/ /d/
Acoustic
Model
Feature
Extraction
Pattern
Classification
Language
Model
Word
Lexicon
Multiple Pronunciation:
Several words may have multiple pronunciations:
Data
/d/ /ae/ /t/ /ax/
Data
/d/ /ey/ /t/ /ax/
Challenges:
• How do you generate a word lexicon automatically?
• How do you add new variant dialects and word pronunciations?
Confidence
Scoring
The Lexicon
•
•
•
•
An entry per word (> 100K words for dictation)
Multiple pronunciations (tomato)
Done by hand or with letter-to-sound rules (LTS)
LTS rules can be automatically trained with
decision trees (CART)
– less than 8% errors, but proper nouns are hard!
/ax/
/ey/
0.7
0.5
0.2
0.8
/t/
/t/
/m/
0.3
0.2
/ow/
/ow/
0.8
0.2
0.3
/aa/
/dx/
Language Model
Goal:
Acoustic
Model
Model “acceptable” spoken phrases
constrained by task syntax
Rule-based:
Deterministic knowledge-driven grammars:
Feature
Extraction
Pattern
Classification
Language
Model
flying from $city to $city on $date
Statistical:
Compute estimate of word probabilities (N-gram, class-based, CFG)
flying from Newark to Boston tomorrow
0.6
0.4
Word
Lexicon
Confidence
Scoring
Formal grammars
Rewrite Rules:
S
NP
NAME
Mary
VP
V
NP
ADJ
loves
NP1
N
1. S NP VP
2. VP V NP
3. VP AUX VP
4. NP ART NP1
5. NP ADJ NP1
6. NP1 ADJ NP1
7. NP1 N
8. NP NAME
9. NP PRON
10. NAME Mary
11. V loves
12. ADJ that
13. N person
that
person
Chomsky Grammar Hierarchy
Types
Constraints
Automata
Phase structure grammar
. This is the
most general
grammar.
Turing machine
Context-sensitive grammar
Subset of 
||≤||, where |.|
indicates the length
of the string.
Linear bounded automata
Context-free grammar (CFG)
Aw and ABC,
where w is a
terminal and B, C
Push down automata
Aw and AwB
Finite-state automata
are non-terminals.
Regular grammar
Ngrams
P( W)  P( w1 , w2 ,K , wn )
 P( w1 ) P( w2 | w1 ) P( w3 | w1 , w2 )L P( wn | w1, w2 ,K , wn 1 )
n
  P(wi | w1, w2 ,K , wi 1 )
i 1
Trigram Estimation
P( wi | wi  2, wi 1 ) 
C( wi  2, wi 1, wi )
C( wi  2, wi 1 )
Understanding Bigrams
• Training data:
– “John read her book”
– “I read a different book”
– “John read a book by Mark”
C ( s , John ) 2

C ( s  )
3
C ( John, read ) 2
P( read | John ) 

C ( John )
2
P( John | s ) 
C ( read , a ) 2

C ( read )
3
C (a, book ) 1
P(book | a ) 

C (a )
2
C (book ,  / s ) 2
P( / s | book ) 

C (book )
3
P(a | read ) 
P( John read a book )
 P( John | s ) P(read | John) P(a | read ) P(book | a) P( / s | book )  0.148
• But we have a problem here
P(Mark read a book )
 P(Mark | s ) P(read | Mark ) P(a | read ) P(book | a) P( / s | book )  0
Ngram Smoothing
• Data sparseness: in millions of words more than
50% of trigrams occur only once.
• Can’t assign p(wi|wi-1, wi-2)=0
• Solution: assign non-zero probability for each
ngram by lowering the probability mass of seen
ngrams
Psmooth ( wi | wi  n 1 ...wi 1 )
  PML ( wi | wi  n 1 ...wi 1 )  (1   ) Psmooth ( wi | wi  n  2 ...wi 1 )
Perplexity
• Cross-entropy of a language model on word
sequence W is
1
H ( W)  
log2 P( W)
NW
• And its perplexity
PP ( W )  2 H ( W )
measures the complexity of a language model
(geometric mean of branching factor).
Perplexity
• For digit recognition task (TIDIGITS) has 10
words, PP=10 and 0.2% error rate
• Airline Travel Information System (ATIS) has
2000 words and PP=20 and a 2.5% error rate
• Wall Street Journal Task has 5000 words and
PP=130 with bigram and 5% error rate
• In general, lower perplexity => lower error rate,
but it does not take acoustic confusability into
account: E-set (B, C, D, E, G, P, T) has PP=7
and has 5% error rate.
Ngram Smoothing
• Deleted Interpolation algorithm estimates  that
maximizes probability on held-out data set
• We can also map all out-of-vocabulary words to
the unknown word
P( wi | wi 1 ) 
1  C( wi 1 , wi )
 (1  C(w
i 1 , wi ))
wi

1  C( wi 1 , wi )
V
 C( w
i 1 , wi )
wi
• Other backoff smoothing algorithms possible:
Katz, Kneser-Ney, Good-Turing, class ngrams
Adaptive Language Models
• Cache Language Models
Pcache (wi | wi  n 1...wi 1 )  c Ps (wi | wi n 1...wi 1 )  (1  c ) Pcache (wi | wi  2 wi 1 )
• Topic Language Models
• Maximum Entropy Language Models
Bigram Perplexity
•Trained on 500 million words and tested on Encarta
Encyclopedia
500
Perplexity
400
300
200
100
0
10k
30k
40k
Vocabulary Size
60k
OOV Rate
•OOV rate measured on Encarta Encyclopedia.
Trained on 500 million words.
25
OOV Rate
20
15
10
5
0
10k
30k
40k
Vocabulary Size
60k
WSJ Results
• Perplexity and word error rate on the 60000-word Wall
Street Journal continuous speech recognition task
• Unigrams, bigrams and trigrams were trained from 260
million words
• Smoothing mechanisms Kneser-Ney
Models
Unigram
Bigram
Perplexity
1200
176
Word Error Rate
14.9%
11.3%
Trigram
91
9.6%
Pattern Classification
Goal:
Find “optimal” word sequence:
Combine information (probabilities) from
• Acoustic model
• Word lexicon
• Language model
Feature
Extraction
Acoustic
Model
Pattern
Classification
Language
Model
Word
Lexicon
Method:
Decoder searches through all possible recognition
choices using a Viterbi decoding algorithm
Challenge:
Efficient search through a large network space is computationally
expensive for large vocabulary ASR
Confidence
Scoring
The Problem of Large
Vocabulary ASR
•The basic problem in ASR is to find the sequence of words
that explain the input signal. This implies the following mapping
Features
HMM states
HMM units
Phones
Words
Sentences
For the WSJ 20K vocabulary, this results in a network
22
of 10 bytes!
•State—of-the-art methods include fast match, multi-pass
decoding, A* stack, and finite state transducers all
provide tremendous speed-up by searching through the
network and finding the best path that maximizes the
likelihood function
Weighted Finite State
Transducers (WFST)
• Unified Mathematical framework to ASR
• Efficiency in time and space
Word:Phrase
WFST
Phone:Word
WFST
Composition
HMM:Phone
Optimization
Search
Network
WFST
8
State:HMM
WFST
WFST can compile the network to 10 states
14 orders of magnitude more efficient
Acoustic
Model
Confidence Scoring
Goal:
Pattern
Classification
Feature
Extraction
Identify possible recognition errors
and out-of-vocabulary events. Potentially
improves the performance of ASR, SLU and DM.
Language
Model
Confidence
Scoring
Word
Lexicon
Method:
A confidence score based on a hypothesis likelihood ratio test is
associated with each recognized word:
Label:
Recognized:
Confidence:
Challenges:
credit please
credit fees
(0.9) (0.3)
P ( W | X) 
P( W) P( X | W)
 P( W) P( X | W)
W
Rejection of extraneous acoustic events (noise, background speech,
door slams) without rejection of valid user input speech.
TTS
ASR
SLG
SLU
Speech Recognition
Process
DM
Acoustic
Model
Input
Speech
Feature
Extraction
Pattern
Classification
(Decoding,
Search)
Language
Model
Word
Lexicon
Confidence “Hello World”
Scoring
(0.9) (0.8)
How to evaluate performance?
• Dictation applications: Insertions, substitutions
and deletions
Subs  Dels  Ins
Word Error Rate  100% 
No.of word in the correct sentence
• Command-and-control: false rejection and false
acceptance => ROC curves.
ASR Performance: The Stateof-the-art
CORPUS
(DARPA)
TYPE OF
SPEECH
VOCABULARY
SIZE
WORD ERROR
RATE
Connected
Digit Strings
Airline Travel
Information
North American
Business News
Broadcast News
Read speech
10
0.3%
Spontaneous
2500
2.5%
Read speech
64,000
6.6%
News shows
210,000
13-17%
Conversational
Telephone
45,000
25-29%
Switchboard
Vocabulary Size
Growth in Effective
Recognition Vocabulary Size
10000000
1000000
100000
10000
1000
100
10
1
1960
1970
1980
1990
Year
2000
2010
Improvement in Word
Accuracy
Word Accuracy
80
70
60
10-20% relative
reduction per year.
50
40
30
1996
1997
1998
1999
2000
Year
Switchboard/Call Home
Vocabulary: 40,000 words
Perplexity: 85
2001
2002
Human Speech Recognition
vs ASR
MACHINE ERROR (%)
100
10
x100
1
x10
x1
0.1
0.001
0.01
0.1
Machines
Outperform
Humans
1
HUMAN ERROR (%)
Digits
RM-LM
NAB-mic
WSJ
RM-null
NAB-omni
SWBD
WSJ-22dB
10
Challenges in ASR
System Performance
- Accuracy
- Efficiency (speed, memory)
- Robustness
Operational Performance
- End-point detection
- User barge-in
- Utterance rejection
- Confidence scoring
Machines are 10-100 times less accurate than humans
Large-Vocabulary ASR Demo
Multimedia Customer Care
Courtesy of AT&T
Voice-enabled System
Technology Components
Speech
Speech
TTS
ASR
Text-to-Speech
Synthesis
Data,
Rules
Words
Spoken Language
Generation
SLG
Action
Words
SLU
DM
Dialog
Management
Automatic Speech
Recognition
Meaning
Spoken Language
Understanding
Spoken Language
Understanding (SLU)
• Goal: Extract and interpret the meaning of the
recognized speech so that to identify a user’s
request.
– accurate understanding can often be achieved
without correctly recognizing every word
– SLU makes it possible to offer natural-language
based services where the customer can speak
“openly” without learning a specific set of terms.
• Methodology: Exploit task grammar (syntax) and
task semantics to restrict the range of meaning
associated with the recognized word string.
• Applications: Automation of complex operator-based
tasks, e.g., customer care, customer help lines, etc.
SLU Formulation
• Let W be a sequence of words and C be
its underlying meaning (conceptual
structure), then using Bayes rule
P(C | W )  P(W | C ) P(C ) / P(W )
• Finding the best conceptual structure
can be done by parsing and ranking
using a combination of acoustic,
linguistic and semantics scores.
C*  arg max P(W | C ) P(C )
C
Knowledge
Sources for Speech
Understanding
ASR
Acoustic/
Phonetic
Phonotactic
SLU
Syntactic
Semantic
DM
Pragmatic
Relationship
Rules for
Structure of
Relationship
Discourse,
of speech
phoneme
words, phrases and meanings interaction history,
sounds and
sequences and in a sentence among words world knowledge
English phonemes pronunciation
Acoustic
Model
Word
Lexicon
Language Understanding Dialog
Model
Model
Manager
SLU Components
From ASR (text, lattices, n-best, history)
Text
Normalization
Database
Access
Parsing/
Decoding
Interpretation
Morphology, Synonyms
Extracting named entities,
semantic concepts, syntactic
tree
Slot filling, reasoning, task
knowledge representation
To DM (concepts, entities, parse tree)
Text
Normalization
Parsing/
Decoding
Interpretation
Text Normalization
Goal: Reduce language variation (lexical analysis)
• Morphology
-Decomposing words into their “minimal unit
of meaning (or grammatical analysis)
• Synonyms
- Finding words that mean the same (hello, hi, how do you do)
• Equalization
- disfluencies, non-alphanumeric, capitals, non-white space characters, etc.
Text
Normalization
Parsing/
Decoding
Interpretation
Parsing/Decoding
Goal: Mapping textual representation into
semantic concepts using knowledge rules
and/or data.
• Entity Extraction (Matcher)
- Finite state machines (FSM)
- Context Free Grammar (CFG)
• Semantic Parsing
• Decoding
- Tree structured meaning representation allowing
arbitrary nesting of concepts
- Segment an utterance into phrases each representing a
concept (e.g. using HMMs)
• Classification
-Categorizing an utterance into one or more semantic concepts.
Text
Normalization
Parsing/
Decoding
Interpretation
Interpretation
Goal: Interpret user’s utterance in the context
of the dialog.
• Ellipses and anaphora
“What about for New York?”
• History Mechanism
-Communicated messages are differential.
- Removing ambiguities
• Semantic Frames (slots)
- Associating entities and relations (attribute/value pairs)
- Rule-based
• Database Look-up
- Retrieving or checking entities
DARPA Communicator
• Darpa sponsored research and development of mixed-initiative
dialogue systems
• Travel task involving airline, hotel and car information and reservation
“Yeah I uh I would like to go from New York to Boston tomorrow
night with United”
• SLU output (Concept decoding)
Topic: Itinerary
Origin: New York
Destination: Boston
Day of the week: Sunday
Date: May 25th, 2002
Time: >6pm
Airline: United
XML Schema
<itinerary>
<origin>
<city></city>
<state></state>
</origin>
<destination>
<city></city>
<state></state>
</destination>
<date></date>
<time></time>
<airline></airline>
</itinerary>
Semantic CFG
<rule name=“itinerary”>
<o>Show me flights</o> <ruleref name=“origin"/>
<ruleref name=“destination"/>
</rule>
<rule name=“origin”>
from <ruleref name=“city”>
</rule>
<rule name=“destination”>
to <ruleref name=“city”>
</rule>
<rule name=“city”>
New York | San Francisco | Boston
</rule>
AT&T How May I Help You?
- Customer Care Services 

User responds with unconstrained fluent speech
System recognizes salient phrases and determines meaning of
users speech, then routes the call
“There is a number on my bill I didn’t make” – unrecognized number
How May I help You?
Unrecognized
number
billing
credit
Account
balance
Agent
Combined bill
Voice-enabled System
Technology Components
Speech
Speech
TTS
ASR
Text-to-Speech
Synthesis
Data,
Rules
Words
Spoken Language
Generation
SLG
Action
Words
SLU
DM
Dialog
Management
Automatic Speech
Recognition
Meaning
Spoken Language
Understanding
Dialog Management (DM)
• Goal:
– Manage elaborate exchanges with the user
– Provide automated access to information
• Implementation:
– Mathematical model programmed from rules
and/or data
Computational Models for DM
• Structural-based Approach
– Assumes that there exists a regular structure in dialog that can
be modeled as a state transition network or dialog grammars
– Dialog strategies need to be predefined, thus limiting
– Several real-time deployed systems exist today which have
inference engines and knowledge representation
• Plan-based Approach
– Considers communication as “acting”. Dialog acts and actions are
oriented toward goal achievement
– Motivated by human/human interaction in which humans
generally have goals and plans when interacting with others
– Aims to account for general models and theories of discourse
DM Technology
Components
Context Interpretation
TTS
ASR
SLG
SLU
DM
Dialog strategies (modules)
Backend (Database access)
Context
Interpretation
Context Interpretation
Dialog
Strategies
Backend
User
Input
Current
Context
State(t)
Action
New
Context
State(t+1)
• A formal representation of the context history is
necessary so that for the DM to interpret a user’s
utterance given previously exchanged utterances and
identify a new action.
• Natural communication is a differential process
Context
Interpretation
Dialog
Strategies
Backend
Dialog Strategies
•Completion (continuation) : elicits missing input
information from the user
•Constraining (disambiguation): reduces the scope
of the request when multiple information has been
retrieved
•Relaxation: increases the scope of the request when no
information has been retrieved
•Confirmation: verifies that the system understood correctly;
strategy may differ depending on SLU confidence measure.
•Reprompting: used when the system expected input but did not
receive any or did not understand what it received.
•Context Help: provides user with help during periods of
misunderstanding of either the system or the user.
•Greeting/Closing: maintains social protocol at the beginning and
end of an interaction
•Mixed-initiative: allows users to manage the dialog.
Mixed-initiative Dialog
Who manages the dialog?
System
Please say collect,
calling card, third
number.
Initiative
User
How can I help
you?
Example: Mixed-Initiative
Dialog
• System Initiative
System: Please say just your departure city.
User:
Chicago
System: Please say just your arrival city.
User: Newark
Long dialogs but easier
to design
• Mixed Initiative
System: Please say your departure city
User: I need to travel from Chicago to Newark tomorrow.
• User Initiative
System: How may I help you?
User: I need to travel from
Chicago to Newark tomorrow.
Shorter dialogs (better
user experience) but
more difficult to design
Dialog Evaluation
Why?
•
•
•
•
•
Identify problematic situations (task failure)
Minimize user hang-ups and routing to an operator
Improve user experience
Perform data analysis and monitoring
Conduct data mining for sales and marketing
How?
• Log exchange features from ASR, SLU, DM, ANI
• Define and automatically optimize a task objective function
• Compute service statistics (Instant and delayed)
Optimizing Task Objective
Function
User Satisfaction is the ultimate objective of a
dialog system
• Task completion rate
• Efficiency of the dialog interaction
• Usability of the system
• Perceived intelligibility of the system
• Quality of the audio output
• Perplexity and quality of the response generation
Machine learning techniques can be applied to predict
user satisfaction (e.g., Paradise framework)
Example of Plan-based Dialog
Systems.
Keyboard
Speech
Interpretation
Generation
Reference
Text and
Display
Generators
Parser
Interpretati
on
Manager
Discourse
Context
Generation
Manager
Speech
Displays
Discourse
Problemsolving
Task
Manager
Planner
Behavioral
Agent
Schedul
er
TRIPS System Architecture
Reasoning
Monitor
s
Events
Help Desk
Example of Structurebased Dialog Systems
- AT&T Labs Natural Voices TM
• A finite state engine is used to control the action of the
interpreter output.
• The system routes customers to appropriate agents or
departments
• Provide information about products, services, costs and
the business.
• Show demonstrations of various voice fonts and
languages.
• Trouble shoot problems and concerns raised by
customers (in progress)
Voice-enabled System
Technology Components
Speech
Speech
TTS
ASR
Text-to-Speech
Synthesis
Words to be
Words
synthesized
Spoken Language
Generation
Data,
Rules
SLG
Action
Words spoken
Words
SLU
DM
Dialog
Management
Automatic Speech
Recognition
Meaning
Meaning
Spoken Language
Understanding
A Good User Interface
• makes the application easy-to-use
• Makes application robust to the kinds of confusion
that arise in human-machine communications by
voice
• keeps the conversation moving forward, even in
periods of great uncertainty on the parts of either the
user or the machine
• A great UI can not save a system with poor ASR and
NLU. But, UI can make or break a system, even
with excellent ASR & NLU
• Effective UI design is based on a set of elementary
principles common widgets; sequenced screen
presentations; simple error-trap dialog; a user
manual.
Correction UI
• N-best list
Multimodal System
Technology Components
Speech
Speech
Pen
Gesture
Visual
TTS
ASR
Text-to-Speech
Synthesis
Data,
Rules
Words
Spoken Language
Generation
SLG
Action
Words
SLU
DM
Dialog
Management
Automatic Speech
Recognition
Meaning
Spoken Language
Understanding
Multimodal Experience
• Access to information
anytime and anywhere using
any device
• “Most appropriate” UI mode
or combination of modes
depends on the device, the
task, the environment, and
the user’s abilities and
preferences.
Multimodal Architecture
What the user might do (say)
P(F|x, Sn-1)
Language
model
What does that mean
P(Sn|F, Sn-1)
Semantic
model
Parsing
x
(Multimodal Inputs)
F (Surface Semantics)
Understanding
Application
Logic
Sn (Discourse Semantics)
Rendering
A
(Multimedia Outputs)
Behavior What to show (say) to the user
P(A|Sn)
model
MIPad
• Multimodal Interactive Pad
• Usability studies show
double throughput for
English
• Speech is mostly useful in
cases with lots of
alternatives
MiPad video
Speech-enabled MapPoint
Other Research Areas
not covered
• Speaker recognition (verification and identification)
• Language identification
• Human/Human and Human/Machine translation
• Multimedia and document retrieval
• Speech coding
• Microphone array processing
• Multilingual speech recognition
Application Development
Speech APIs
• Open-standard APIs provide separation of the ASR/TTS
engines and platform from the application layer.
• Application is engine independent and contains the SLU,
DM, Content Server and Host Interface
Application
API Layer
TTS Engine
Audio Destination Object
ASR Engine
Audio Source Object
Voice XML Architecture
Multimedia
HTML
Scripts
VoiceXML
Audio/Grammars
Web Server
Voice XML
Gateway
Voice
Browser
“The temperature is”
A VoiceXML example
<?xml version="1.0"?>
<vxml application="tutorial.vxml" version="1.0">
<form id="getPhoneNumber">
<field name="PhoneNumber">
<prompt>What's your phone number?</prompt>
<grammar src="../grammars/phone.gram“
type="application/x-jsgf" />
<help>Please say your ten digit phone number. </help>
<if cond=“PhoneNumber <1000000'">
</field>
</form>
</vxml>
Speech Application Language
Tags (SALT)
Add 4 tags to HTML/XHTML/cHTML/WML:
<listen>, <prompt>, <dtmf>, ,smex>
<html>
<form action="nextpage.html">
<input type=“text” id=“txtBoxCity” />
</form>
<listen id=reco1>
<grammar src=“cities.gram” />
<bind targetElement=“txtBoxCity” value=“city” />
</listen>
</html>
The Speech Advantage!
• Reduce costs
– Reduce labor expenses while still providing
customers an easy-to-use and natural way to
access information and services
• New revenue opportunities
– 24x7 high-quality customer care automation
– Access to information without a keyboard or
touch-tone
• Customer retention
– Stickiness of services
– Add personalization
Just the Beginning…
Market Opportunities
• Consumer communication
- Voice portals
- Desktop dictation
- Telematic applications
- Disabilities
• Call center automation
- Help Lines
- Customer care
• Voice-assisted E-commerce
- B2B
• Enterprise Communication
- Unified messaging
- Enterprise sales
A look into the future..
Directory Assistance
VRCP
1990+
• Constrained speech
• Minimal data collection
• Manual design
Keyword spotting
Handcrafted grammars
No dialogue
1995+
• Constrained speech
• Moderate data collection
• Some automation
Airline reservation
Medium size ASR
Banking
Handcrafted Grammars
System Initiative
• Spontaneous speech
• Extensive data collection
• Semi-automation
MATCH: Multimodal Access To
Call centers,
Large size ASR
City Help
E-commerce
Limited NLU
Mixed-initiative
2000+
2005+
• Spontaneous speech/pen
• Fully automated systems
Unlimited ASR
Deeper NLU
Adaptive systems
Multimodal,
Multilingual
Help Desks,
E-commerce
Spoken Dialog Interfaces Vs.
Touch-Tone IVR?
Reading Materials
• Spoken Language Processing, X. Huang, A.
Acero, H-W. Hon, Prentice Hall, 2001.
• Fundamentals of Speech Recognition, L. Rabiner
and B.H. Juang, Prentice Hall, 1993.
• Automatic Speech and Speaker Recognition,
C.H. Lee, F. Soong, K. Paliwal, Kluwer Academic
Publishers, 1996.
• Spoken Dialogues with Computers, Editor R. DeMori, Academic Press 1998.
Download