Lecture 10: ASR: Sequence Recognition

advertisement
EE E6820: Speech & Audio Processing & Recognition
Lecture 10:
ASR: Sequence Recognition
1
Signal template matching
2
Statistical sequence recognition
3
Acoustic modeling
4
The Hidden Markov Model (HMM)
Dan Ellis <dpwe@ee.columbia.edu>
http://www.ee.columbia.edu/~dpwe/e6820/
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 1
Signal template matching
1
Framewise comparison of unknown word and
stored templates:
ONE TWO THREE
Reference
FOUR
FIVE
•
70
60
50
40
30
20
10
10
20
30
40
50 time /frames
Test
- distance metric?
- comparison between templates?
- constraints?
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 2
Dynamic Time Warp (DTW)
•
Find lowest-cost constrained path:
- matrix d(i,j) of distances between input frame fi
and reference frame rj
- allowable predecessors & transition costs Txy
Reference frames rj
Lowest cost to (i,j)
T10
D(i,j) = d(i,j) + min
T
11
T01
D(i-1,j)
D(i-1,j)
{
D(i-1,j) + T10
D(i,j-1) + T01
D(i-1,j-1) + T11
}
Local match cost
D(i-1,j)
Best predecessor
(including transition cost)
Input frames fi
•
Best path via traceback from final state
- have to store predecessors for (almost) every (i,j)
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 3
DTW-based recognition
Reference templates for each possible word
•
Isolated word:
- mark endpoints of input word
- calculate scores through each template (+prune)
- choose best
•
Continuous speech
- one matrix of template slices;
special-case constraints at word ends
ONE TWO THREE
Reference
FOUR
•
Input frames
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 4
DTW-based recognition (2)
+ Successfully handles timing variation
+ Able to recognize speech at reasonable cost
-
Distance metric?
- pseudo-Euclidean space?
-
Warp penalties?
-
How to choose templates?
- several templates per word?
- choose ‘most representative’?
- align and average?
→ need a rigorous foundation...
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 5
Outline
1
Signal template matching
2
Statistical sequence recognition
- state-based modeling
3
Acoustic modeling
4
The Hidden Markov Model (HMM)
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 6
2
Statistical sequence recognition
•
DTW limited because it’s hard to optimize
- interpretation of distance, transition costs?
•
Need a theoretical foundation: Probability
•
Formulate as MAP choice among models:
*
M = argmax p ( M j X , Θ )
Mj
- X = observed features
- Mj = word-sequence models
- Θ = all current parameters
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 7
Statistical formulation (2)
•
Can rearrange via Bayes’ rule (& drop p(X) ):
*
M = argmax p ( M j X , Θ )
Mj
= argmax p ( X M j, Θ A ) p ( M j Θ L )
Mj
- p(X | Mj) = likelihood of obs’v’ns under model
- p(Mj) = prior probability of model
- ΘA = acoustics-related model parameters
- ΘL = language-related model parameters
•
Questions:
- what form of model to use for p ( X M j, Θ A ) ?
- how to find ΘA (training)?
- how to solve for Mj (decoding)?
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 8
State-based modeling
•
Assume discrete-state model for the speech:
- observations are divided up into time frames
- model → states → observations:
Model Mj
Qk : q1 q2 q3 q4 q5 q6 ...
states
time
X1N : x1 x2 x3 x4 x5 x6 ...
•
observed feature
vectors
Probability of observations given model is:
p( X M j) =
∑
N
p ( X 1 Q k, M j ) ⋅ p ( Q k M j )
all Q k
- sum over all possible state sequences Qk
•
How do observations depend on states?
How do state sequences depend on model?
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 9
The speech recognition chain
•
After classification, still have problem of
classifying the sequences of frames:
sound
Feature
calculation
feature vectors
Acoustic
classifier
Network
weights
phone probabilities
Word models
Language model
HMM
decoder
phone & word
labeling
•
Questions
- what to use for the acoustic classifier?
- how to represent ‘model’ sequences?
- how to score matches?
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 10
Outline
1
Signal template matching
2
Statistical sequence recognition
3
Acoustic modeling
- defining targets
- neural networks & Gaussian models
4
The Hidden Markov Model (HMM)
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 11
Acoustic Modeling
3
•
Goal: Convert features into probabilities of
particular labels:
i.e find p ( q n X n ) over some state set {qi}
i
- conventional statistical classification problem
•
Classifier construction is data-driven
- assume we can get examples of known good Xs
for each of the qis
- calculate model parameters by standard training
scheme
•
Various classifiers can be used
- GMMs model distribution under each state
- Neural Nets directly estimate posteriors
•
Different classifiers have different properties
- features, labels limit ultimate performance
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 12
Defining classifier targets
•
Choice of {qi} can make a big difference
- must support recognition task
- must be a practical classification task
•
Hand-labeling is one source...
- ‘experts’ mark spectrogram boundaries
•
...Forced alignment is another
- ‘best guess’ with existing classifiers, given words
•
Result is targets for each training frame:
Feature
vectors
Training
targets
E6820 SAPR - Dan Ellis
g w
eh
L10 - Sequence recognition
n
time
2002-04-15 - 13
Forced alignment
•
Best labeling given existing classifier
constrained by known word sequence
Feature
vectors
time
Existing
classifier
Known word
sequence
Dictionary
Phone
posterior
probabilities
ow th r iy ...
Training
targets
ow
th r
iy n
s
Constrained
alignment
ow th r
iy
Classifier
training
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 14
Gaussian Mixture Models vs. Neural Nets
•
GMMs fit distribution of features under states:
- separate ‘likelihood’ model for each state qi
1
1
k
T –1
µ
-x
p ( x q ) = ---------------------------------⋅
exp
–
(
–
)
k Σk ( x – µk )
d
1⁄2
2
( 2π ) Σ k
- match any distribution given enough data
•
Neural nets estimate posteriors directly
p ( q x ) = F [ ∑ w jk ⋅ F [ ∑ w ij x i ] ]
j
j
k
- parameters set to discriminate classes
•
Posteriors & likelihoods related by Bayes’ rule:
k
k
p ( x q ) ⋅ Pr ( q )
p ( q x ) = ----------------------------------------------j
j
(
)
⋅
Pr
(
q
)
p
x
q
∑
k
j
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 15
Outline
1
Signal template matching
2
Statistical sequence recognition
3
Acoustic classification
4
The Hidden Markov Model (HMM)
- generative Markov models
- hidden Markov models
- model fit likelihood
- HMM examples
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 16
Markov models
3
•
A (first order) Markov model
is a finite-state system
whose behavior depends
only on the current state
•
S
E.g. generative Markov model:
.8
.8
.1
p(qn+1|qn) S
A
B
.1
S 0
A 0
.1
.1
.1
qn B 0
.1
C 0
C
E 0
.1
E
.7
qn+1
A B C E
1
.8
.1
.1
0
0
.1
.8
.1
0
0
.1
.1
.7
0
0
0
0
.1
1
SAAAAAAAABBBBBBBBBCCCCBBBBBBCE
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 17
Hidden Markov models
•
Markov models where state sequence Q = {qn}
is not directly observable (= ‘hidden’)
•
But, observations X do depend on Q:
- xn is rv that depends on current state: p ( x q )
State sequence
q=A
0.8
q=B
AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC
q=C
0.6
0.4
0
0
q=A
0.8
q=B
q=C
3
0.6
xn
p(x|q)
2
1
0.2
0.4
2
1
0.2
0
Observation
sequence
3
xn
p(x|q)
Emission distributions
0
1
2
3
4
observation x
0
0
10
20
- can still tell something about state seq...
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
30
time step n
2002-04-15 - 18
(Generative) Markov models (2)
•
HMM is specified by:
- transition probabilities p ( q n q n – 1 ) ≡ a ij
j
i
- (initial state probabilities p ( q 1 ) ≡ π i )
i
- emission distributions p ( x q ) ≡ b i ( x )
i
- states qi
•
- transition
probabilities aij
- emission
distributions bi(x)
k
a
t
•
•
k
a
t
•
•
k
a
t
•
•
k
a
t
k
1.0
0.9
0.0
0.0
a
0.0
0.1
0.9
0.0
p(x|q)
x
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 19
t
0.0
0.0
0.1
0.9
•
0.0
0.0
0.0
0.1
Markov models for speech
•
Speech models Mj
- typ. left-to-right HMMs (sequence constraint)
- observation & evolution are conditionally
independent of rest given (hidden) state qn
S
ae1
ae2
ae3
E
q1
q2
q3
q4
q5
x1
x2
x3
x4
x5
- self-loops for time dilation
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 20
Markov models for sequence recognition
•
Independence of observations:
- observation xn depends only current state qn
p ( X Q ) = p ( x 1, x 2, …x N q 1, q 2, …q N )
= p ( x1 q1 ) ⋅ p ( x2 q2 ) ⋅ … p ( x N q N )
=
•
∏n = 1
N
p ( xn qn ) =
∏n = 1 b q n ( x n )
N
Markov transitions:
- transition to next state qi+1 depends only on qi
p ( Q M ) = p ( q 1, q 2, …q N M )
= p ( q N q 1 …q N– 1 ) p ( q N– 1 q 1 …q N– 2 )…p ( q 2 q 1 ) p ( q 1 )
= p ( q N q N– 1 ) p ( q N– 1 q N– 2 )…p ( q 2 q 1 ) p ( q 1 )
= p ( q1 ) ∏
E6820 SAPR - Dan Ellis
N
n=2
a
p ( q n q n –1 ) = π q ∏
n = 2 q n–1 q n
1
N
L10 - Sequence recognition
2002-04-15 - 21
Model fit calculation
•
From ‘state-based modeling’:
p( X M j) =
∑
N
p ( X 1 Q k, M j ) ⋅ p ( Q k M j )
all Q k
•
For HMMs:
p( X Q) =
∏n = 1 b q n ( x n )
N
p ( Q M ) = πq ⋅ ∏
1
•
N
n=2
aqn–1 qn
Hence, solve for M* :
- calculate p ( X M j ) for each available model,
scale by prior p ( M j ) → p ( M j X )
•
Sum over all Qk ???
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 22
Summing over all paths
Model M1
0.8
0.7
A
S
0.2
0.1
E
0.1
S
•
•
•
•
S
A
B
E
B
0.2
1
A B E
0.9 0.1 •
0.7 0.2 0.1
• 0.8 0.2
•
• 1
E
States
0.9
xn Observations
x1, x2, x3
p(x|B)
p(x|A)
B
3
n
p(x|q)
q{
x1
x2
x3
A 2.5 0.2 0.1
B 0.1 2.2 2.3
Paths
0.8 0.8 0.2
0.1
0.1 0.2 0.2
A
S
2
Observation
likelihoods
0.7
0.7
0.9
0
1
2
3
4 time n
All possible 3-emission paths Qk from S to E
p(Q | M) = Πn p(qn|qn-1)
p(X | Q,M) = Πn p(xn|qn) p(X,Q | M)
.9 x .7 x .7 x .1 = 0.0441
.9 x .7 x .2 x .2 = 0.0252
.9 x .2 x .8 x .2 = 0.0288
.1 x .8 x .8 x .2 = 0.0128
Σ = 0.1109
0.0022
2.5 x 0.2 x 0.1 = 0.05
0.0290
2.5 x 0.2 x 2.3 = 1.15
2.5 x 2.2 x 2.3 = 12.65 0.3643
0.1 x 2.2 x 2.3 = 0.506 0.0065
Σ = p(X | M) = 0.4020
q0 q1 q2 q3 q4
S
S
S
S
A
A
A
B
A
A
B
B
A
B
B
B
E
E
E
E
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 23
The ‘forward recursion’
•
Dynamic-programming-like technique to
calculate sum over all Qk
•
Define α n ( i ) as the probability of getting to
state qi at time step n (by any path):
i
n
i
α n ( i ) = p ( x 1, x 2, …x n, q n =q ) ≡ p ( X 1, q n )
Model states qi
•
Then α n + 1 ( j ) can be calculated recursively:
αn+1(j) =
αn(i+1)
αn(i)
E6820 SAPR - Dan Ellis
ai+1j
S
[i=1
Σ αn(i)·aij]·bj(xn+1)
bj(xn+1)
aij
Time steps n
L10 - Sequence recognition
2002-04-15 - 24
Forward recursion (2)
•
Initialize α 1 ( i ) = π i ⋅ b i ( x 1 )
•
N
Then total probability p ( X 1
S
M) =
∑ αN ( i )
i=1
→ Practical way to solve for p(X | Mj) and hence
perform recognition
p(X | M1)·p(M1)
p(X | M2)·p(M2)
Observations
X
E6820 SAPR - Dan Ellis
Choose best
L10 - Sequence recognition
2002-04-15 - 25
Optimal path
•
May be interested in actual qn assignments
- which state was ‘active’ at each time frame
- e.g. phone labelling (for training?)
•
Total probability is over all paths...
•
... but can also solve for single best path
= “Viterbi” state sequence
•
Probability along best path to state q n + 1 :
j
*
*
α n + 1 ( j ) = max { α n ( i )a ij } ⋅ b j ( x n + 1 )
i
- backtrack from final state to get best path
- final probability is product only (no sum)
→ log-domain calculation just summation
•
Total probability often dominated by best path:
*
p ( X, Q M ) ≈ p ( X M )
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 26
Interpreting the Viterbi path
•
Viterbi path assigns each xn to a state qi
- performing classification based on bi(x)
- ... at the same time as applying transition
constraints aij
Inferred classification
xn
3
2
1
0
Viterbi labels:
0
10
20
30
AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC
•
Can be used for segmentation
- train an HMM with ‘garbage’ and ‘target’ states
- decode on new data to find ‘targets’, boundaries
•
Can use for (heuristic) training
- e.g. train classifiers based on labels...
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 27
Recognition with HMMs
•
Isolated word
- choose best p ( M X ) ∝ p ( X M ) p ( M )
Model M1
w
Input
ah
p(X | M1)·p(M1) = ...
n
Model M2
t
p(X | M2)·p(M2) = ...
uw
Model M3
th
•
r
p(X | M3)·p(M3) = ...
iy
Continuous speech
- Viterbi decoding of one large HMM gives words
p(M1)
Input
p(M2)
sil
p(M3)
E6820 SAPR - Dan Ellis
w
ah
t
th
L10 - Sequence recognition
n
uw
r
iy
2002-04-15 - 28
HMM example:
Different state sequences
0.9
0.8
Model M1
K
S
0.2
K
O
0.2
E
0.8
0.1
T
0.2
E
p(
p(x | q)
Emission
distributions
0.2
T
x|
p( O)
x|
p( K )
x|
p( T)
x|
A)
S
0.1
0.9
0.8
Model M2
A
0.8
0.8
0.6
0.4
0.2
0
E6820 SAPR - Dan Ellis
0
1
L10 - Sequence recognition
2
3
4
x
2002-04-15 - 29
Model inference:
Emission probabilities
Observation
sequence
xn
4
A
T
K2
O0
0
2
4
6
8
10
12
14
Model M1
4
log p(X | M) = -32.1
2
state alignment
log p(X,Q* | M) = -33.5
0
0
-1
log trans.prob
log p(Q* | M) = -7.5 -2-3
0
log obs.l'hood
-5
log p(X | Q*,M) = -26.0 -10
Model M2
4
log p(X | M) = -47.0
2
state alignment
*
log p(X,Q | M) = -47.5
0
0
-1
log trans.prob
log p(Q* | M) = -8.3 -2-3
0
log obs.l'hood
-5
log p(X | Q*,M) = -39.2-10
E6820 SAPR - Dan Ellis
16
18
time n / steps
L10 - Sequence recognition
2002-04-15 - 30
Model inference:
Transition probabilities
0.9
Model M'1
S
0.8
0.18
A
0.9
0.1
0.9
K
0.02
Model M'2
0.8
0.2
T
0.05
A
0.1
0.15
O
state alignment
0.1
0.9
K
S
E
0.8
0.8
T
0.2
E
0.1
O
4
2
0
0
log obs.l'hood
-5
log p(X | Q*,M) = -26.0 -10
0
2
4
6
8
10
12
14
16
18
time n / steps
Model M'1
log p(X | M) = -32.2
log p(X,Q* | M) = -33.6
log trans.prob
log p(Q* | M) = -7.6
Model M'2
log p(X | M) = -33.5
log p(X,Q* | M) = -34.9
log trans.prob
log p(Q* | M) = -8.9
E6820 SAPR - Dan Ellis
0
-1
-2
-3
0
-1
-2
-3
L10 - Sequence recognition
2002-04-15 - 31
Validity of HMM assumptions
•
Key assumption is conditional independence:
Given qi, future evolution & obs. distribution
are independent of previous events
- duration behavior: self-loops imply exponential
distribution
p(N = n)
γ
1−γ
γ(1−γ)
n
- independence of successive xns
p(xn|qi)
xn
n
p( X ) =
E6820 SAPR - Dan Ellis
∏ p ( xn q ) ?
i
L10 - Sequence recognition
2002-04-15 - 32
Recap: Recognizer Structure
sound
Feature
calculation
feature vectors
Acoustic
classifier
Network
weights
phone probabilities
Word models
Language model
HMM
decoder
phone & word
labeling
•
Know how to execute each state
•
.. training HMMs?
•
.. language/word models
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 33
Summary
•
Speech is modeled as a sequence of features
- need temporal aspect to recognition
- best time-alignment of templates = DTW
•
Hidden Markov models are rigorous solution
- self-loops allow temporal dilation
- exact, efficient likelihood calculations
E6820 SAPR - Dan Ellis
L10 - Sequence recognition
2002-04-15 - 34
Download