Hidden Markov Models for Automatic Speech Recognition

advertisement
Hidden Markov Models
for Automatic Speech Recognition
Dr. Mike Johnson
Marquette University, EECE Dept.
Page 1
Overview
Marquette University
Intro: The problem with sequential data
Markov chains
Hidden Markov Models
Key HMM algorithms
Evaluation
Alignment
Training / parameter estimation
Examples / applications
Page 2
Big Picture
View of
Statistical
Models
HMMs
Basic Gaussian
Nonstationary sequential data
Marquette University
1
0.5
0
-0.5
-1
0
0.2
0.4
0.6
0.8
Original speed
1
0
0.2
0.4
0.6
0.8
1
Same word, different tempo
1.2
1.4
1.6
1.2
1.4
1.6
1
0.5
0
-0.5
-1
Page 4
Historical Method: Dynamic Time Warping
Marquette University
DTW is a dynamic path search versus template
Can solve using Dynamic Programming
Warping grid
7
6
Template
5
4
3
2
1
0
0
1
2
3
4
5
6
7
Input
Page 5
Alternative: Sequential modeling
Marquette University
Use a Markov Chain (state machine)
State Machine
S1
S2
S3
State
Distribution
Models
Data
Page 6
Markov Chains (discrete-time & state)
Marquette University
 A Markov chain is a discrete-time discrete-state
Markov Process. The likelihood of the current RV
going to any new state is determined solely by the
current state, called a transition probability
aij  P  si  s j 
Note: since transition probabilities are fixed,
there is also a time-invariance assumption.
(Also false of course, but useful)
Page 7
Graphical representation
Marquette University
a13
a11
S1
a22
a12
a33
a23
S2
S3
a32
a21
a31
 Markov chain parameters include
 Transition probability values aij
 Initial state probabilities 1 2 3
Page 8
Example: Weather Patterns
Marquette University
Probability of Rain, Clouds, or Sunshine
modeled as a Markov chain:
A=
R
C
S
R
0.7
0.4
0.1
C
0.2
0.4
0.1
S
0.1
0.2
0.8
Note: A matrix of this form (square, row sum=1) is called a
stochastic matrix.
Page 9
Two-step probabilities
Marquette University
If it’s raining today, what’s the probability of it
raining two days from now?
Need two-step probabilities.
Answer = 0.7*0.7 + 0.2*0.4 + 0.1*0.1 = .58
Can also get these directly from A2 :
A2
=
R
C
S
R
0.58
0.46
0.19
C
0.23
0.26
0.14
S
0.19
0.28
0.67
Page 10
Steady-state
Marquette University
The N-step probabilities can be gotten from AN,
so A is sufficient to determine the likelihoods of
all possible sequences.
What’s the limiting case? Does it matter if it was
raining 1000 days ago?
A1000 =
R
C
S
R
0.4
0.4
0.4
C
0.2
0.2
0.2
S
0.4
0.4
0.4
Page 11
Probability of state sequence
Marquette University
The probability of any state sequence is given
by:
P( s1s2 s3 ...sT )  P(s1 ) P(s2 | s1 ) P(s3 | s2 )...P(sT | sT 1 )
  1as1s2 as2 s3 ...as(T 1) sT
Training: Learn the transition probabilities by
keeping count of the state sequences in the
training data.
Page 12
Weather classification
Marquette University
Using a Markov chain for classification:
Train one Markov chain model for each class
ex: A weather transition matrix for each city;
Milwaukee, Phoenix, and Miami
Given a sequence of state observations, identify
which is the most likely city by choosing the model
that gives the highest overall probability.
Page 13
Hidden states & HMMs
Marquette University
 What if you can’t directly observe states?
 But… there are measures/observations that relate to the
probability of different states.
 States hidden from view = Hidden Markov Model.
Page 14
General Case HMM
Marquette University
si : state i
b1(ot)
aij : P(si  sj )
ot : output at time t
bj(ot) : P (ot | sj )
b3(ot)
Initial: 1 2 3
b2(ot)
b4(ot)
Page 15
Weather HMM
Marquette University
Extend Weather Markov Chain to HMM’s
Can’t see if it’s raining, cloudy, or sunny.
But, we can make some observations:
 Humidity H
 Temperature T
 Pressure P
How do we calculate …
 Probability of an observation sequence under a model
How do we learn …
 State transition probabilities for unseen states
 Observation probabilities in each state
Page 16
Observation models
Marquette University
How do we characterize these observations?
Discrete/categorical observations: Learn probability
mass function directly.
Continuous observations: Assume a parametric
model.
 Our Example: Assume a Gaussian distribution
 Need to estimate the mean and variance of the humidity,
temperature and pressure for each state
(9 means and 9 variances, for each city model)
Page 17
HMM classification
Marquette University
Using a HMM for classification:
Training: One HMM for each class
 Transition matrix plus state means and variances
(27 parameters) for each city
Classification: Given a sequence of observations:
 Evaluate P(O|model) for each city
(Much harder to compute for HMM than for Markov Chain)
 Choose the model that gives the highest overall probability.
Page 18
Using for Speech Recognition
Marquette University
States represent beginning, middle, end of a phoneme
a13
a24
a22
S1
a12
a35
a33
a23
S2
a44
a34
S3
a45
S4
Start State
b2(•)
b3(•)
S5
End State
b4(•)
Gaussian Mixture Model
in each state
Page 19
Fundamental HMM Computations
Marquette University
Evaluation: Given a model  and an
observation sequence O = (o1, o2, …, oT),
compute P(O | ).
Alignment: Given  and O, compute the
‘correct’ state sequence S = (s1, s2, …, sT), such
as S = argmaxS { P (S |O,  ) }.
Training: Given a group of observation
sequences, find an estimate of , such as
ML = argmax { P (O |  ) }.
Page 20
Evaluation: Forward/Backward algorithm
Marquette University
Define i(t) = P(o1o2..ot, st=i |  )
Define i(t) = P(ot+1ot+2..oT | st=i ,  )
Each of these can be implemented efficiently via
dynamic programming recursions starting at
t=1 (for ) and t=T (for ).
By putting the forward & backward together:
 i (t )  i (t )  P(o1...ot ...oT | st  i)
N
P(O |  )   i (t )  i (t )
i 1
Page 21
Forward Recursion
Marquette University
1. Initialization
i (1)   ibi (o1 ) i {1..N}
2. Recursion
 N

 j (t  1)   i (t )aij  b j (ot 1 )  i {1..N}, t {1..T}
 i 1

3. Termination
 N

P(O |  )   i (T ) 
 i 1

Page 22
Backward recursion
Marquette University
1. Initialization
i (T )  1 i {1..N}
2. Recursion
N
i (t )   aij b j (ot 1 )  j (t  1)  i {1..N}, t {(T 1)..T }
j 1
3. Termination
 N

P(O |  )     ibi (o1 ) i (1) 
 i 1

Page 23
Note: Computation improvement
Marquette University
Direct computation: P(O | ) = the sum of the
observation probabilities for all possible state
sequences = NT.
Time complexity = O(T NT)
F/B algorithm: For each state at each time step
do a maximization over all state values from the
previous time step:
Time Complexity = O(T N2)
Page 24
From i(t) and i(t) :
Marquette University
One-State Occupancy probability
 t (i)  t (i)
 t (i)  t (i)
 t (i) 
 N
P(O |  )
 t ( j )  t ( j )
j 1
Two-state Occupancy probability
 t (i)aijb j (ot 1 )  t 1 ( j )
t (i, j ) 

P(O |  )
 t (i)aijb j (ot 1 )  t 1 ( j )
N
N
 (k )a
k 1 l 1
t
b (ot 1 )  t 1 (l )
kl l
Page 25
Alignment: Viterbi algorithm
Marquette University
To find single most likely state sequence S,
use Viterbi dynamic programming algorithm:
1.
Initialization:
i (1)   ibi (o1 ) i {1..N}
2.
Recursion:
 j (t )  max i (t  1)aij b j (ot )
i
3.
Termination:
Pmax ( S , O |  )  max i (T )
i
Page 26
Training
Marquette University
We need to learn the parameters of the model,
given the training data.
Possibilities include:
 Maximum a Priori (MAP)
  arg max P( | O)
 Maximum Likelihood (ML)
  arg max P(O |  )
 Minimum Error Rate
  arg min Error Rate over Training Data
Page 27
Expectation Maximization
Marquette University
Expectation Maximization(EM) can be used for
ML estimation of parameters in the presence
of hidden variables.
Basic iterative process:
1. Compute the state sequence likelihoods given
current parameters
2. Estimate new parameter values given the state
sequence likelihoods.
Page 28
EM Training: Baum-Welch
Marquette University
for Discrete Observations
(e.g. VQ coded)
Basic Idea: Using current  and F/B equations,
compute state occupation probabilities. Then,
compute new values:
T 1
E{Num berof transitions from i to j}
 ij' 

E{Num berof transitions from i}
  (i, j )
t 1
T 1
t

t
t 1
(i )
T 1

E{Num berof observations of ot in i}
 i' (ot ) 

E{Num berof tim es in i}
t
(i )
t 1
s.t. state i emits o t
T 1

t 1
t
(i )
Page 29
Marquette University
Update equations for Gaussian distributions:
T
μˆ i 
 Ps o o
t 1
T
i
t
t
 Ps o 
t 1
i
t
T
ˆ 
Σ
i
 P  si ot   ot  μ k  ot  μ k 
T
t 1
n
 Ps o 
k 1
i
t
GMMs are similar, but need to incorporate
mixture likelihoods as well as state likelihoods
Page 30
Toy example: Genie and the urns
Marquette University
 There are N urns in a nearby room; each contains many
balls of M different colors.
 A genie picks out a sequence of balls from the urns and
shows you the result. Can you determine the sequence
of urns they came from?
 Model as HMM: N states, M outputs
 probabilities of picking from an urn are state transitions
 number of different colored balls in each urn makes up the
probability mass function for each state.
Page 31
Working out the Genie example
Marquette University
There are three baskets of colored balls
Basket one: 10 blue and 10 red
Basket two: 15 green, 5 blue, and 5 red
Basket three: 10 green and 10 red
The genie chooses from baskets at random
25% chance of picking from basket one or two
50% chance of picking from basket three
Page 32
Genie Example Diagram
Marquette University
Page 33
Two Questions
Marquette University
Assume that the genie reports a sequence of
two balls as {blue, red}.
Answer two questions:
What is the probability that a two ball sequence will
be {blue, red}?
What is the most likely sequence of baskets to
produce the sequence {blue, red}?
Page 34
Probability of {blue, red}
for Specific Basket Sequence
Marquette University
P  O  , i, j    i  bi  blue   aij  b j  red 
 p  i   bi  blue   p  j   b j  red 
First/Second
Basket One Basket Two Basket Three
Basket One
0.01562
0.00625
0.03125
Basket Two
0.00625
0.00250
0.01250
Basket Three
0.0
0.0
0.0
Page 35
Probability of {blue,red}
Marquette University
What is the total probability of {blue,red}?
 Sum(matrix values)= 0.074375
What is the most likely sequence of baskets visited?
 Argmax(matrix values) = {Basket 1, Basket 3}
 Corresponding max likelihood = 0.03125
Page 36
Viterbi method
Marquette University
1 (1)   1b1 (o1 )  (.25)(.5)  0.125
2 (1)   2b2 (o1 )  (.25)(.2)  0.05
3 (1)   3b3 (o1 )  (.5)(0)  0
1 (2)  max (.125)(.25)(.5), (.05)(.25)(.5), 0  0.015625
2 (2)  max (.125)(.25)(.2), (.05)(.25)(.2), 0  0.00625
3 (2)  max (.125)(.5)(.5), (.05)(.5)(.5), 0  0.03125
Best path ends in state 3, coming previously from state 1.
Page 37
Composite Models
Marquette University
 Training data is at sentence level, generally not
annotated at sub-word (HMM model) level.
 Need to be able to form composite models from a
sequence of word or phoneme labels.
a13
a22
S1
Start State
a12
S2
a24
a33
a23
S3
a35
a13
a44
a34
S4
a22
a45
S5
End State
S1
Start State
a12
S2
a24
a33
a23
S3
a35
a44
a34
S4
a45
S5
End State
Page 38
Viterbi and Token Passing
Marquette University
b
a
a
b
b
b
c
c
c
d
d
d
...
...
...
z
z
z
...
...
...
...
...
a
c
...
d
Best Sentence
b
c
d
...
a
c
z
Recognition Network
a
b
c
c
f
d
...
e
d
Word Graph
Page 39
HMM Notation
Marquette University
Discrete HMM Case:

N
Q
M
V
T
O
B
A

The set of all parameters for one HMM
Number of states in a model
Set of possible states
Number of output symbols
Set of possible outputs
Number of observations in observation
Sequence of observations {o1 .. oT}
Output matrix, NxM, with row i = output distribution for state i
State transition matrix, NxN
Initial probability vector, length N
Page 40
Marquette University
Continuous HMM Case:

N
Q
T
O
bi( )
i
i
A

The set of all parameters for one HMM
Number of states in a model
Set of possible states
Number of observations in observation
Sequence of observations {o1 .. oT}
Output distribution for state i, = N(i, i) (diagonal covariance matrix)
Vector of mean values for state i
Vector of standard deviation values for state i
State transition matrix, NxN
Initial probability vector, length N
Page 41
Multi-mixture, multi-observation case:
Marquette University

N
M
R
T
Q
Nq
Tr
O
ot
S
st
(q)
St
aij
i
bj(ot)
cjm
bjm(ot)
jm
jm
The set of all parameters for one HMM
Number of states in current model
Number of mixtures in output distribution in current model
Number of sentences in training set
Number of observations in current sentence
Number of models in current (training) sentence label
Number of states in model q
Number of observations in sentence r
The sequence of observations in current sentence
The observation vector at time t
The sequence of states
The state at time t
The state of model q at time t
The probability of a transition from state i to j in current model
The initial probability of being in state I
The observation output probability in state j of current model
The mixture weight for mixture m in state j of current model
The observation output probability for mixture m in state j of current model
The mean vector for mixture component m in state j
The covariance matrix for mixture component m in state j
Page 42
Download