Lecture 9: Speech Recognition

advertisement
EE E6820: Speech & Audio Processing & Recognition
Lecture 9:
Speech Recognition
Dan Ellis <dpwe@ee.columbia.edu>
Michael Mandel <mim@ee.columbia.edu>
Columbia University Dept. of Electrical Engineering
http://www.ee.columbia.edu/∼dpwe/e6820
April 7, 2009
1
2
3
4
Recognizing speech
Feature calculation
Sequence recognition
Large vocabulary, continuous speech recognition (LVCSR)
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
1 / 43
Outline
1
Recognizing speech
2
Feature calculation
3
Sequence recognition
4
Large vocabulary, continuous speech recognition (LVCSR)
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
2 / 43
Recognizing speech
“So, I thought about that and I think it’s still possible”
Frequency
4000
2000
0
0
0.5
1
1.5
2
2.5
3
Time
What kind of information might we want from the speech
signal?
I
I
I
I
words
phrasing, ‘speech acts’ (prosody)
mood / emotion
speaker identity
What kind of processing do we need to get at that
information?
I
I
I
time scale of feature extraction
signal aspects to capture in features
signal aspects to exclude from features
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
3 / 43
Speech recognition as Transcription
Transcription = “speech to text”
I
find a word string to match the utterance
Gives neat objective measure: word error rate (WER) %
I
can be a sensitive measure of performance
Reference:
THE CAT SAT ON THE
MAT
–
Recognized:
CAT SAT AN THE A MAT
Deletion
Substitution
Insertion
Three kinds of errors:
WER = (S + D + I )/N
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
4 / 43
Problems: Within-speaker variability
Timing variation
I
word duration varies enormously
Frequency
4000
2000
0
0
SO
I
1.0
0.5
s ow
1.5
2.0
ay
aa ax b aw ax ay ih k t s t ih
l
th dx
th n th n ih
I
ABOUT
I
IT'S STILL
THOUGHT
THAT THINK
AND
2.5
3.0
p aa s b ax l
POSSIBLE
fast speech ‘reduces’ vowels
Speaking style variation
I
I
careful/casual articulation
soft/loud speech
Contextual effects
I
speech sounds vary with context, role:
“How do you do?”
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
5 / 43
Problems: Between-speaker variability
Accent variation
I
regional / mother tongue
Voice quality variation
I
gender, age, huskiness, nasality
Individual characteristics
mbma0
mannerisms, speed, prosody
freq / Hz
I
8000
6000
4000
2000
0
8000
6000
fjdm2
4000
2000
0
0
E6820 (Ellis & Mandel)
0.5
1
1.5
L9: Speech recognition
2
2.5
time / s
April 7, 2009
6 / 43
Problems: Environment variability
Background noise
I
fans, cars, doors, papers
Reverberation
I
‘boxiness’ in recordings
Microphone/channel
I
huge effect on relative spectral gain
Close
mic
freq / Hz
4000
2000
0
4000
Tabletop
mic
2000
0
E6820 (Ellis & Mandel)
0
0.2
0.4
0.6
0.8
L9: Speech recognition
1
1.2
1.4
time / s
April 7, 2009
7 / 43
How to recognize speech?
Cross correlate templates?
I
I
I
waveform?
spectrogram?
time-warp problems
Match short-segments & handle time-warp later
I
I
model with slices of ∼10 ms
pseudo-stationary model of words:
freq / Hz
sil
g
w
0.15
0.2
eh
n
sil
4000
3000
2000
1000
0
I
0
0.05
0.1
0.25
0.3
0.35
0.4
0.45 time / s
other sources of variation. . .
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
8 / 43
Probabilistic formulation
Probability that segment label is correct
I
gives standard form of speech recognizers
Feature calculation: s[n] → Xm
I
(m =
n
H)
transforms signal into easily-classified domain
Acoustic classifier: p(q i | X )
I
calculates probabilities of each mutually-exclusive state q i
‘Finite state acceptor’ (i.e. HMM)
Q ∗ = argmax p(q0 , q1 , . . . qL | X0 , X1 , . . . XL )
{q0 ,q1 ,...qL }
I
MAP match of allowable sequence to probabilities:
q0 = “ay”
q1 ...
X
E6820 (Ellis & Mandel)
0 1 2 ...
L9: Speech recognition
time
April 7, 2009
9 / 43
Standard speech recognizer structure
Fundamental equation of speech recognition:
Q ∗ = argmax p(Q | X , Θ)
Q
= argmax p(X | Q, Θ)p(Q | Θ)
Q
I
I
I
I
X = acoustic features
p(X | Q, Θ) = acoustic model
p(Q | Θ) = language model
argmaxQ = search over sequences
Questions:
I
I
I
what are the best features?
how do we do model them?
how do we find/match the state sequence?
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
10 / 43
Outline
1
Recognizing speech
2
Feature calculation
3
Sequence recognition
4
Large vocabulary, continuous speech recognition (LVCSR)
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
11 / 43
Feature Calculation
Goal: Find a representational space
most suitable for classification
I
I
I
waveform: voluminous, redundant, variable
spectrogram: better, still quite variable
...?
Pattern Recognition:
representation is upper bound on performance
I
I
maybe we should use the waveform. . .
or, maybe the representation can do all the work
Feature calculation is intimately bound to classifier
I
pragmatic strengths and weaknesses
Features develop by slow evolution
I
current choices more historical than principled
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
12 / 43
Features (1): Spectrogram
Plain STFT as features e.g.
X
Xm [k] = S[mH, k] =
s[n + mH] w [n] e −j2πkn/N
n
Consider examples:
freq / Hz
Feature vector slice
8000
6000
4000
2000
0
8000
6000
4000
2000
0
0
0.5
1
1.5
2
2.5
time / s
Similarities between corresponding segments
I
but still large differences
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
13 / 43
Features (2): Cepstrum
Idea: Decorrelate, summarize spectral slices:
Xm [`] = IDFT{log |S[mH, k]|}
I
I
good for Gaussian models
greatly reduce feature dimension
Male
spectrum
cepstrum
Female
8000
4000
0
8
6
4
2
8000
spectrum
4000
cepstrum
0
8
6
4
2
0
E6820 (Ellis & Mandel)
0.5
1
1.5
L9: Speech recognition
2
2.5
time / s
April 7, 2009
14 / 43
Features (3): Frequency axis warp
Linear frequency axis gives equal ‘space’
to 0-1 kHz and 3-4 kHz
I
but perceptual importance very different
Warp frequency axis closer to perceptual axis
I
mel, Bark, constant-Q . . .
X [c] =
uc
X
|S[k]|2
k=`c
Male
8000
spectrum 4000
0
15
audspec
10
5
Female
audspec
15
10
5
0
E6820 (Ellis & Mandel)
0.5
1
1.5
L9: Speech recognition
2
2.5
time / s
April 7, 2009
15 / 43
Features (4): Spectral smoothing
Generalizing across different speakers is
helped by smoothing (i.e. blurring) spectrum
Truncated cepstrum is one way:
I
MMSE approx to log |S[k]|
LPC modeling is a little different:
MMSE approx to |S[k]| → prefers detail at peaks
level / dB
I
50
audspec
30
Male
plp
40
0
2
4
6
8
10
12
14
16
18 freq / chan
15
audspec 10
5
15
plp 10
smoothed 5
0
E6820 (Ellis & Mandel)
0.5
1
1.5
L9: Speech recognition
2
2.5
time / s
April 7, 2009
16 / 43
Features (5): Normalization along time
Idea: feature variations, not absolute level
Hence: calculate average level and subtract it:
Ŷ [n, k] = X̂ [n, k] − mean{X̂ [n, k]}
n
Factors out fixed channel frequency response
x[n] = hc ∗ s[n]
X̂ [n, k] = log |X [n, k]| = log |Hc [k]| + log |S[n, k]|
Male
plp
15
10
5
15
mean
norm
10
5
Female 15
mean 10
norm 5
0
E6820 (Ellis & Mandel)
0.5
1
1.5
L9: Speech recognition
2
2.5
time / s
April 7, 2009
17 / 43
Delta features
Want each segment to have ‘static’ feature vals
I but some segments intrinsically dynamic!
→ calculate their derivatives—maybe steadier?
Append dX /dt (+ d 2 X /dt 2 ) to feature vectors
Male
ddeltas
deltas
15
10
5
15
10
5
15
plp 10
(µ,σ norm) 5
0
0.5
1
1.5
2
2.5
time / s
Relates to onset sensitivity in humans?
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
18 / 43
Overall feature calculation
MFCCs and/or RASTA-PLP
Sound
FFT X[k]
FFT X[k]
spectra
Mel scale
freq. warp
Key attributes:
Bark scale
freq. warp
spectral, auditory scale
audspec
decorrelation
log|X[k]|
log|X[k]|
IFFT
Rasta
band-pass
Truncate
LPC
smooth
Subtract
mean
Cepstral
recursion
CMN MFCC
features
Rasta-PLP
cepstral features
cepstra
E6820 (Ellis & Mandel)
smoothed
onsets
smoothed (spectral)
detail
normalization of levels
LPC
spectra
L9: Speech recognition
April 7, 2009
19 / 43
Features summary
Male
Female
8000
spectrum
4000
0
15
audspec 10
5
rasta
15
10
5
deltas
15
10
5
0
0.5
1
1.5
0
0.5
1
time / s
Normalize same phones
Contrast different phones
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
20 / 43
Outline
1
Recognizing speech
2
Feature calculation
3
Sequence recognition
4
Large vocabulary, continuous speech recognition (LVCSR)
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
21 / 43
Sequence recognition: Dynamic Time Warp (DTW)
FOUR
ONE TWO THREE
Reference
FIVE
Framewise comparison with stored templates:
70
60
50
40
30
20
10
10
20
30
40
50 time /frames
Test
I
I
distance metric?
comparison across templates?
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
22 / 43
Dynamic Time Warp (2)
Find lowest-cost constrained path:
I
I
matrix d(i, j) of distances
between input frame fi and reference frame rj
allowable predecessors and transition costs Txy
T10
D(i,j) = d(i,j) + min
D(i-1,j)
T01
11
D(i-1,j)
T
Reference frames rj
Lowest cost to (i,j)
{
D(i-1,j) + T10
D(i,j-1) + T01
D(i-1,j-1) + T11
}
Local match cost
D(i-1,j)
Best predecessor
(including transition cost)
Input frames fi
Best path via traceback from final state
I
store predecessors for each (i, j)
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
23 / 43
DTW-based recognition
Reference templates for each possible word
For isolated words:
I
mark endpoints of input word
calculate scores through each template (+prune)
Reference
ONE TWO THREE
FOUR
I
Input frames
I
continuous speech: link together word ends
Successfully handles timing variation
I
recognize speech at reasonable cost
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
24 / 43
Statistical sequence recognition
DTW limited because it’s hard to optimize
I
I
learning from multiple observations
interpretation of distance, transition costs?
Need a theoretical foundation: Probability
Formulate recognition as MAP choice among word sequences:
Q ∗ = argmax p(Q | X , Θ)
Q
I
I
I
X = observed features
Q = word-sequences
Θ = all current parameters
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
25 / 43
State-based modeling
Assume discrete-state model for the speech:
I
I
observations are divided up into time frames
model → states → observations:
Model Mj
Qk : q1 q2 q3 q4 q5 q6 ...
states
time
X1N : x1 x2 x3 x4 x5 x6 ...
observed feature
vectors
Probability of observations given model is:
X
p(X | Θ) =
p(X1N | Q, Θ) p(Q | Θ)
all Q
I
sum over all possible state sequences Q
How do observations X1N depend on states Q?
How do state sequences Q depend on model Θ?
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
26 / 43
HMM review
HMM is specified by parameters Θ:
- states qi
- transition
probabilities aij
- emission
distributions bi(x)
k
•
a
t
•
•
k
a
t
•
•
k
a
t
•
•
k
a
t
k
1.0
0.9
0.0
0.0
a
0.0
0.1
0.9
0.0
t
0.0
0.0
0.1
0.9
•
0.0
0.0
0.0
0.1
p(x|q)
x
(+ initial state probabilities πi )
i
aij ≡ p(qnj | qn−1
)
E6820 (Ellis & Mandel)
bi (x) ≡ p(x | qi )
L9: Speech recognition
πi ≡ p(q1i )
April 7, 2009
27 / 43
HMM summary (1)
HMMs are a generative model: recognition is inference of
p(Q | X )
During generation, behavior of model depends only on current
state qn :
I
I
transition probabilities p(qn+1 | qn ) = aij
observation distributions p(xn | qn ) = bi (x)
Given states Q = {q1 , q2 , . . . , qN }
and observations X = X1N = {x1 , x2 , . . . , xN }
Markov assumption makes
Y
p(X , Q | Θ) =
p(xn | qn )p(qn | qn−1 )
n
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
28 / 43
HMM summary (2)
Calculate p(X | Θ) via forward recursion:
" S
#
X
n j
p(X1 , qn ) = αn (j) =
αn−1 (i)aij bj (xn )
i=1
Viterbi (best path) approximation
∗
∗
αn (j) = max αn−1 (i)aij bj (xn )
i
I
then backtrace. . .
Q ∗ = argmax(X , Q | Θ)
Q
Pictorially:
X
M*
Q*
observed
inferred
M=
Q = {q1,q2,...qn}
assumed, hidden
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
29 / 43
Outline
1
Recognizing speech
2
Feature calculation
3
Sequence recognition
4
Large vocabulary, continuous speech recognition (LVCSR)
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
30 / 43
Recognition with HMMs
Isolated word
I
choose best p(M | X ) ∝ p(X | M)p(M)
Model M1
w
Input
ah
p(X | M1)·p(M1) = ...
n
Model M2
t
p(X | M2)·p(M2) = ...
uw
Model M3
th
r
p(X | M3)·p(M3) = ...
iy
Continuous speech
I
Viterbi decoding of one large HMM gives words
p(M1)
Input
p(M2)
sil
p(M3)
E6820 (Ellis & Mandel)
w
th
L9: Speech recognition
ah
t
n
uw
r
iy
April 7, 2009
31 / 43
Training HMMs
Probabilistic foundation allows us to train HMMs to ‘fit’
training data
i.e. estimate aij , bi (x) given data
I better than DTW. . .
Algorithms to improve p(Θ | X ) are key to success of HMMs
I
maximum-likelihood of models. . .
State alignments Q for training examples are generally
unknown
I
... else estimating parameters would be easy
Viterbi training
I
I
‘Forced alignment’
choose ‘best’ labels (heuristic)
EM training
I
‘fuzzy labels’ (guaranteed local convergence)
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
32 / 43
Overall training procedure
Labelled training data
Word models
“two one”
one
“five”
two
“four three”
w
t
three
Data
ah
th
n
uw
r
iy
Models
t
uw
th
r
f
ao
w
ah
n
th
r
iy
iy
Fit models to data
Repeat
until
Re-estimate model parameters convergence
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
33 / 43
Language models
Recall, fundamental equation of speech recognition
Q ∗ = argmax p(Q | X , Θ)
Q
= argmax p(X | Q, ΘA )p(Q | ΘL )
Q
So far, looked at p(X | Q, ΘA )
What about p(Q | ΘL )?
I
I
Q is a particular word sequence
ΘL are parameters related to the language
Two components:
I
I
link state sequences to words p(Q | wi )
priors on word sequences p(wi | Mj )
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
34 / 43
HMM Hierarchy
HMMs support composition
I
can handle time dilation, pronunciation, grammar all within
the same framework
ae1
k
ae2
ae
ae3
p(q | M) = p(q, φ, w | M)
t
aa
= p(q | φ)
· p(φ | w )
CAT
ATE
· p(wn | w1n−1 , M)
THE
SAT
DOG
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
35 / 43
Pronunciation models
Define states within each word p(Q | wi )
Can have unique states for each word (‘whole-word’
modeling), or . . .
Sharing (tying) subword units between words to reflect
underlying phonology
I
I
I
more training examples for each unit
generalizes to unseen words
(or can do it automatically. . . )
Start e.g. from pronunciation dictionary:
ZERO(0.5)
ZERO(0.5)
ONE(1.0)
TWO(1.0)
E6820 (Ellis & Mandel)
z iy r ow
z ih r ow
w ah n
tcl t uw
L9: Speech recognition
April 7, 2009
36 / 43
Learning pronunciations
‘Phone recognizer’ transcribes training data as phones
I
align to ‘canonical’ pronunciations
Baseform Phoneme String
f ay v y iy r ow l d
f ah ay v y uh r ow l
Surface Phone String
I
I
infer modification rules
predict other pronunciation variants
e.g. ‘d deletion’:
d → ∅|`stop
p = 0.9
Generate pronunciation variants; use forced alignment to find
weights
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
37 / 43
Grammar
Account for different likelihoods of different words and word
sequences p(wi | Mj )
‘True’ probabilities are very complex for LVCSR
I
need parses, but speech often agrammatic
→ Use n-grams:
p(wn | w1L ) = p(wn | wn−K , . . . , wn−1 )
e.g. n-gram models of Shakespeare:
n=1
n=2
n=3
n=4
To him swallowed confess hear both. Which. Of save on . . .
What means, sir. I confess she? then all sorts, he is trim, . . .
Sweet prince, Falstaff shall die. Harry of Monmouth’s grave. . .
King Henry. What! I will go seek the traitor Gloucester. . . .
Big win in recognizer WER
I
I
raw recognition results often highly ambiguous
grammar guides to ‘reasonable’ solutions
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
38 / 43
Smoothing LVCSR grammars
n-grams (n = 3 or 4) are estimated from large text corpora
I
I
100M+ words
but: not like spoken language
100,000 word vocabulary → 1015 trigrams!
I
I
never see enough examples
unobserved trigrams should NOT have Pr = 0!
Backoff to bigrams, unigrams
I
I
p(wn ) as an approx to p(wn | wn−1 ) etc.
interpolate 1-gram, 2-gram, 3-gram with learned weights?
Lots of ideas e.g. category grammars
I
I
I
p(PLACE | “went”, “to”)p(wn | PLACE)
how to define categories?
how to tag words in training corpus?
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
39 / 43
Decoding
How to find the MAP word sequence?
States, pronunciations, words define one big HMM
I
with 100,000+ individual states for LVCSR!
→ Exploit hierarchic structure
I
I
phone states independent of word
next word (semi) independent of word history
oy
DECOY
s
DECODES
z
DECODES
axr
DECODER
k
iy
d
uw
ow
d
DECODE
DO
root
b
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
40 / 43
Decoder pruning
Searching ‘all possible word sequences’ ?
I
I
I
need to restrict search to most promising ones: beam search
sort by estimates of total probability
= Pr (so far)+ lower bound estimate of remains
trade search errors for speed
Start-synchronous algorithm:
I
I
extract top hypothesis from queue:
[Pn, {w1 , . . . , wk }, n]
pr. so far
words
next time frame
find plausible words {wi } starting at time n → new hypotheses:
[Pn p(Xnn+N−1 | w i )p(w i | wk . . .),
I
I
{w1 , . . . , wk , w i },
n + N]
discard if too unlikely, or queue is too long
else re-insert into queue and repeat
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
41 / 43
Summary
Speech signal is highly variable
I
I
need models that absorb variability
hide what we can with robust features
Speech is modeled as a sequence of features
I
I
need temporal aspect to recognition
best time-alignment of templates = DTW
Hidden Markov models are rigorous solution
I
I
self-loops allow temporal dilation
exact, efficient likelihood calculations
Language modeling captures larger structure
I
I
I
pronunciation, word sequences
fits directly into HMM state structure
need to ‘prune’ search space in decoding
Parting thought
Forward-backward trains to generate, can we train to discriminate?
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
42 / 43
References
Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted finite-state transducers
in speech recognition. Computer Speech & Language, 16(1):69–88, 2002.
Wendy Holmes. Speech Synthesis and Recognition. CRC, December 2001. ISBN
0748408576.
Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition.
Prentice Hall PTR, April 1993. ISBN 0130151572.
Daniel Jurafsky and James H. Martin. Speech and Language Processing: An
Introduction to Natural Language Processing, Computational Linguistics and
Speech Recognition. Prentice Hall, January 2000. ISBN 0130950696.
Frederick Jelinek. Statistical Methods for Speech Recognition (Language, Speech, and
Communication). The MIT Press, January 1998. ISBN 0262100665.
Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. Spoken Language Processing: A
Guide to Theory, Algorithm and System Development. Prentice Hall PTR, April
2001. ISBN 0130226165.
E6820 (Ellis & Mandel)
L9: Speech recognition
April 7, 2009
43 / 43
Download