Foundations of ASR Steven Wegmann Director Speech Research International Computer Science Institute

advertisement
Foundations of ASR
Steven Wegmann
Director Speech Research
International Computer Science Institute
29 June 2015
Outline
The main idea in the ASR formalism
High level view of HMMs and why they are used in ASR
HMM details and inference algorithms
ASR model training and decoding details
The main idea in the ASR formalism
Some notation:
I
W = sequence of words (transcription)
I
X = acoustic observations or evidence (MFCCs, ∆s, ∆∆s)
Given a particular acoustic utterance, x, our goal is to
compute for every possible transcription w the probability
P(W = w | X = x)
We use these probabilities to decode or recognize the utterance
x by selecting the most likely transcription w recog via:
w recog = arg max P(W = w | X = x)
w
The main idea (cont’d)
We use Bayes’ Theorem
P(w | x) =
P(x | w )P(w )
P(x)
This decomposes the problem into two probability models
I
The acoustic model (AM) gives P(x | w )
I
The language model (LM) gives P(w )
I
The term P(x) is constant so it is (usually) ignored
The main idea (cont’d)
The mainstream choices for these models
I
AM: hidden Markov model (HMM)
I
LM: smoothed n-gram model (earlier lecture)
We can view P(w | x) as the posterior on word sequences
given the acoustic observations
I
Where P(w ) is the prior on word sequences
I
P(x | w ) is the likelihood of x given w
Note that we are not modeling P(w | x) directly
I
Why not?
Generative vs Discriminative classifiers
What we’ve just described is an example of a generative
classifier
I
Model P(X | W = w ) separately for each class W = w
I
X is random, W = w is given
I
Stronger model assumptions
I
Uses maximum likelihood estimation
I
Estimation is “easy”
A discriminative classifier models P(W | X = x) directly
I
W is random, X = x is given
I
Weaker model assumptions
I
Uses conditional maximum likelihood estimation
I
Estimation is “hard”
Our acoustic models are hidden Markov models
We want a probability model for a sequence of frames given a
particular word sequence, i.e.
P(x | w )
In the last 25 years, hidden Markov models (HMM’s) have
provided the most successful form for this model
This is spite of the fact that they make very strong
independence assumptions which are clearly violated in the
case of speech data
Notational quibbles
In this talk X is always a continuous random variable
I
So P(X = x | w ) = P(x | w ) is a probability density function
I
I would normally write f (x | w ) to emphasize this
I
In this talk I will use P for notational simplicity
I
It’s bad because it makes densities look like probabilities:
sorry!
In practice computations are done in the log domain
I
I
So log P(X = x | w ) instead of P(X = x | w )
This is because most of the quantities are very very small
I
I
I
Lots of multiplies of small quantities
In the log domain these becomes sums
However I’m leaving out the logs for clarity
Roughly speaking, what is an HMM?
An HMM consists of two synchronized stochastic processes
I
An unobservable Markov chain of hidden states St
I
An observed process Xt : in our case the acoustic observations
Each St is a discrete random variable, while the Xt can be
either a discrete or continuous random variable
The hidden chain ‘explains’ the observed process, because
each St emits Xt
I
Using a state dependent probability distribution
Why use HMMs for speech recognition?
We will sketch two motivations for HMMs
I
In particular: the hidden state sequence
I
To help build some intuition about HMMs
The first: based on the physiology of speech production
The second: based on properties of the model
The spectrogram serves both motivations
I
It is essentially what we observe
Spectrograms
A spectrogram is a particularly useful way to display
speech data
It displays a sequence of spectral envelopes computed from the
signal
I
Slide a 25ms analysis window in 10ms frame steps
I
Apply a Hamming window
I
Compute the FFT in the window to get the energy at each
frequency
The usual front end (MFCCs) starts with this same processing
I
It—and most other frontends—are transformations of this
picture
A spectrogram (from MIT 6.345 Spring ’03)
Physiological motivation: picture
Physiological motivation: idea
When we speak a sentence our production mechanism
traverses an essentially hidden sequence of configurations
involving our
I
Glottis
I
Vocal tract
I
Tongue
I
Lips . . .
View the HMM’s hidden state sequence as a discrete
approximation to these continuous sequences of configurations
Spectrograms give us hope that relatively few states are
necessary for a reasonable approximation
Second motivation: HMM’s point of view
HMMs were designed to solve the following ‘segmentation’
problem
I
We observe a slowly varying stochastic process produced by a
hidden group of states
I
We know that the hidden states emit in bunches
I
The task is to identify which states emitted the observed data
Spectrogram gives us hope again
The spectrogram segmentation problem
Historical notes
L. Baum and colleagues at the IDA in the 60’s
I
Developed HMMs to solve problems in cryptography
I
Jim Baker was an intern at IDA while a Princeton undergrad
I
Jim went to CMU for his PhD where he applied HMMs to
ASR (1970’s)
I
Jim eventually cofounded Dragon Systems
I
Baum and Simons (a colleague) both went on to found early
hedge funds
Fred Jelinek independently applied HMMs to ASR at IBM
I
First used HMMs in the 60’s to solve a problem involving hard
drive read errors
I
I
May have used the work of Stratonovich?
Stratonovich developed many HMM properties in the late 50’s
but published them in Russian
HMM Details: discrete time stochastic process
A discrete time stochastic process is a sequence of random
variables
I
(X0 , X1 , . . . , Xt , . . . )
I
Where all the Xt share the same sample space
I
The Xt may be vectors
In ASR there are canonical stochastic processes
I
All related to spoken utterances
I
The observed sequence of speech frames: (Xt )
I
The corresponding hidden sequence of HMM states: (St )
I
The sequence of words: (Wi )
HMM Details: Markov chains
A Markov chain is a discrete time stochastic process (St )
I
The St are discrete random variables satisfying
P(St+1 = st+1 | St = st , . . . , S0 = s0 ) = P(St+1 = st+1 | St = st )
I
Called the (first order) Markov assumption
For a Markov chain: the future depends on the past only
through the moment!
We also assume that the St take on a finite set of values
I
I.e., the sample space S is finite
I
Abuse notation and write S = |S|
I
So S = {1, . . . , S}
HMM Details: stationary Markov chains
A Markov chain is called stationary if
P(St+1 | St ) = P(St | St−1 ) = · · · = P(S1 | S0 )
The transition probabilities define a S × S matrix A:
Aij = P(St+1 = j | St = i) ∀t ≥ 0 ∀i, j ∈ S
There is also an initial distribution π:
π(i) = P(S0 = i) ∀i ∈ S
Most HMMs in ASR use this Markov chain
A11
0
A01
/ 1
A22
A12
/ 2
A33
A23
/ 3
There are 5 states
I
State 0 is the initial state
I
State 4 is the final (accepting) state
I
π(0) = 1 and A01 = 1
At state 1 there are only two possible transitions
I
Stay put with probability A11
I
Advance to state 2 with probability A12
I
A11 + A12 = 1
A34
/ 4
HMM definition
A hidden Markov model consists of
I
An observed stochastic process X0 , . . . , XT
I
A hidden stationary Markov chain S0 , . . . , ST
I
Their joint probability distribution P(x0 , . . . , xT , s0 , . . . , sT )
Satisfying these (strong!) assumptions
I
Conditional independence:
I
I
I
Given S0 , . . . , ST , the Xi are independent
Given Si , Xi is independent of Sj when j 6= i
Stationarity: the distribution of Xi given Si does not depend
on i
I
I
I
These are called the state’s output distributions
One distribution per state: {P(x | s)}Ss=1
Not specified by the model, i.e., we choose
Simple toy example using an isolated word
Use this HMM to model the word “COW”
0.5
0
1.0
/ C
0.8
0.5
/ O
0.1
0.2
/ W
0.9
/ 4
States 0 (start) and 4 (accept) do not emit frames (think FSA)
States C, O, W emit frames according to the (as yet
unspecified) distributions
I
P(xt | st = C ), P(xt | st = O), and P(xt | st = W )
Toy example (cont’d)
0.5
0
1.0
/ C
0.8
0.5
/ O
0.1
0.2
/ W
0.9
/ 4
Suppose we have an 5 frame utterance of COW
I
(x0 , x1 , x2 , x3 , x4 )
What are the allowable state sequences and their probabilities?
I
CCCOW
I
CCOOW
I
CCOWW
I
COOOW
I
COOWW
I
COWWW
Toy example (cont’d)
0.5
0
1.0
/ C
0.8
0.5
/ O
0.1
0.2
/ W
For the probability CCCOW multiply the following:
I
1.0 × P(x1 | C )
I
0.5 × P(x2 | C )
I
0.5 × P(x3 | C )
I
0.5 × P(x4 | O)
I
0.2 × P(x5 | W ) × 0.9
I
You do the rest!
0.9
/ 4
Toy example (cont’d)
A11
0
A01
/ 1
A22
A12
/ 2
A33
A23
/ 3
A34
/ 4
In principal there are 3T possible state sequences for a length
T utterance with this HMM
I
Don’t count non-emtting start and accept
I
In reality most of those sequences are not allowed: zero prob
I
Exercise: the number of state sequences is
T
−2
X
k=1
k=
(T − 2)(T − 1)
2
HMM inference problems/algorithms
The joint distribution
P(x0 , . . . , xT , s0 , . . . , sT )
The marginal distribution (forward algorithm)
X
P(x0 , . . . , xT ) =
P(x0 , . . . , xT , s0 , . . . , sT )
s0 ,...,sT
The conditional distributions (forward-backward algorithm)
P(st | x0 , . . . , xt )
The maximum likelihood state sequence (Viterbi algorithm)
ŝML = argmax P(x0 , . . . , xT , s0 , . . . , sT )
s0 ,...,sT
A major reason for the HMM’s success
There are efficient (linear in T ) algorithms for inference
Recursive factorization and dynamic programming are key to
these algorithms
Strong model assumptions allow major simplifications
Make repeated use of the chain rule for probabilities
P(A, B, C ) = P(C | A, B)P(A, B) = P(C | A, B)P(B | A)P(A)
HMM’s joint distribution
First simple decomposition
P(x0 , . . . , xT , s0 , . . . , sT ) = P(x0 , . . . , xT | s0 , . . . , sT )P(s0 , . . . , sT )
Use recursive factorization on the Markov chain
P(s0 , . . . , sT ) = P(sT | s0 , . . . , sT −1 )P(s0 , . . . , sT −1 )
= P(sT | s0 , . . . , sT −1 )P(sT −1 | sT −2 , . . . , s0 )×
. . . P(s1 | s0 )P(s0 )
= P(sT | sT −1 )P(ST −1 | sT −2 ) . . . P(s1 | s0 )P(s0 )
= AsT −1 sT AsT −2 sT −1 . . . As0 s1 π(s0 )
= π(s0 )
TY
−1
t=0
Ast st+1
HMM’s joint distribution (cont’d)
Apply conditional independence assumptions
P(x0 , . . . , xT | s0 , . . . , sT ) =
=
T
Y
t=0
T
Y
P(xt | s0 , . . . , sT )
P(xt | st )
t=0
Putting it all together
P(x0 , . . . , xT , s0 , . . . , sT ) = π(s0 )
TY
−1
t=0
Ast st+1
T
Y
t=0
P(xt | st )
HMM’s marginal distribution
Sum over all possible state sequences
P(x0 , . . . , xT ) =
X
(s0 ,...,sT )∈S×···×S
π(s0 )
TY
−1
Ast st+1
t=0
T
Y
P(xt | st )
t=0
It is not feasible to sum over all state sequences
I
S T +1 state sequences!
I
Instead, computation is done using the forward algorithm
The forward algorithm
Define the forward probabilities
αt (st ) = P(x0 , . . . , xt , st )
Note that
P(x0 , . . . , xT ) =
S
X
P(x0 , . . . , xT , sT ) =
sT =1
S
X
αT (sT )
sT =1
We will derive the recursion
αt (st ) = P(xt | st )
S
X
st−1 =1
Ast−1 st αt−1 (st−1 )
The forward algorithm: dynamic programming
To compute P(x0 , . . . , xT )
I
Initialize: α0 (s0 ) = P(x0 , s0 ) = π(s0 )P(x0 | s0 ) ∀s0 ∈ S
I
At t: compute αt (st ) ∀s0 ∈ S using the results from t − 1 via
αt (st ) = P(xt | st )
S
X
Ast−1 st αt−1 (st−1 )
st−1 =1
I
Terminate: at T
P(x0 , . . . , xT ) =
S
X
αT (sT )
sT =1
Dynamic programming reduces exponential time to linear time
I
O(S T ) → O(T S 2 )
I
For HMMs used in ASR: time is actually O(T S)
The forward algorithm: derivation
The chain rule, conditional independence, and transition
matrix A give
αt (st ) = P(x0 , . . . , xt , st ) =
S
X
P(x0 , . . . , xt , st , st−1 )
st−1 =1
=
S
X
P(xt | x0 , . . . , xt−1 , st , st−1 )P(st | x0 , . . . , xt−1 , st−1 )×
st−1 =1
× P(x0 , . . . , xt−1 , st−1 )
= P(xt | st )
S
X
P(st , | st−1 ) αt−1 (st−1 )
st−1 =1
= P(xt | st )
S
X
st−1 =1
Ast−1 st αt−1 (st−1 )
Toy example again
0.5
0
1.0
/ C
0.8
0.5
/ O
0.1
0.2
/ W
0.9
Exercise: work out the forward algorithm using a 5 frame
utterance of COW
I
(x0 , x1 , x2 , x3 , x4 )
/ 4
Toy example (cont’d)
Related exercise: show that the allowable paths form a trellis
0
/C
x0
/C
/C
O
/O
x1
!
/O
W
!
/W
!
/W
x2
x3
x4
/4
The forward-backward algorithm
We want to compute
P(st | x0 , . . . , xT ) ∀t
Introducing the backward probabilities
βt (st ) = P(xt+i , . . . , xT | st )
Exercise: these satisfy the recursion
βt (st ) =
S
X
st+1 =1
P(xt+1 | st+1 ) Ast st+1 βt+1 (st+1 )
The forward-backward algorithm (cont’d)
The independence assumptions imply
γt (st ) ≡ P(x0 , . . . , xt , xt+1 , . . . , xT , st )
= P(x0 , . . . , xt , st )P(x0 , . . . , xT , st | x0 , . . . , xt , st )
= P(x0 , . . . , xt , st )P(xt+1 , . . . , xT | x0 , . . . , xt , st )
= P(x0 , . . . , xt , st )P(xt+1 , . . . , xT | st )
= αt (st )βt (st )
Thus
P(x0 , . . . , xT , st )
P(x0 , . . . , xT )
γt (st )
= PS
s=1 γt (s)
P(st | x0 , . . . , xT ) =
The forward-backward algorithm: dynamic
programming
To compute P(st | x0 , . . . , xT )
I
Run the forward algorithm
I
Initialize: βT (s) = 1 ∀s ∈ S
I
At t: compute βt (st ) ∀s0 ∈ S using the results from t + 1 via
βt (st ) =
S
X
P(xt+1 | st+1 ) Ast st+1 βt+1 (st+1 )
st+1 =1
I
Terminate: at t = 0, set γt (st ) = αt (st )βt (st ) then ∀t
γt (st )
P(st | x0 , . . . , xT ) = PS
s=1 γt (s)
The Viterbi algorithm: brief review
The goal is to find the maximum likelihood state sequence
ŝML = argmax P(x0 , . . . , xT , s0 , . . . , sT )
s0 ,...,sT
The Viterbi algorithm is
I
Used for decoding/recognition
I
Used for Model parameter estimation
A modified version of the forward algorithm
I
I
I
Sums are replaced by Max: the “Viterbi approximation”
An extra term is introduced to store the “back-trace”
The Viterbi algorithm review (cont’d)
We introduce a quantity similar to αt (st ):
vitt (st ) = max P(x0 , . . . , xt , s0 , . . . , st−1 , st )
s0 ,...,st−1
The following recursions hold
vitt (st ) = P(xt | st ) max Asst vitt−1 (s)
s∈S
bpt (st ) = P(xt | st )argmax Asst bpt−1 (s)
s∈S
In ASR HMMs are usually used to model phones
We want the hidden states to model a portion of the
speech that is consistently realized
I
Usually across a large vocabulary
I
Usually across many speakers
Instead of using phones, we actually use phones in context,
often triphones
I
The realization of a phone depends on context
I
Start with a dictionary entry: cat k ae t
I
Corresponding triphone sequence is: sil-k+ae k-ae+t ae-t+sil
I
Longer contexts, e.g., pentaphones, are useful too
I
In principle one HMM for each triphone
The three HMM states: start, middle, end of phone
I
I
Regions of local stationarity
Phones in context
The triphones need to be clustered
Problem: typical language uses 50 phonemes so 125K
possible triphones
I
Triphone distribution—like words—is approximately Zipfian
I
I
I
I
Small population of frequent triphones
Long tail of rare or non-occurring triphones
Even in huge training sets, many triphones will never occur
What models should we use for unobserved triphones?
I
They may occur during recognition!
Solution: top down decision tree clustering is performed on
the triphone HMM states
I
I
Each triphone state is in exactly one cluster
Decision trees ask a series of questions, e.g.,
I
I
Is the left context a vowel?
Is the right context a stop?
What we need for ASR
A vocabulary
I
The list of words that can be recognized
A dictionary
I
A pronunciation for each word in terms of phonemes
AM and LM
I
I
AM: HMMs for each triphone
LM: models word sequences in vocabulary
I
Often estimated from 100’s of billions of words
The next slide shows the generative model
I
Exploited in recognition and training
The standard ASR formalism
w1
w2
w3
p1
s1
o1
c1
c2
...
c39
p2
s3
s2
o2
c1
c2
...
c39
o3
c1
c2
...
c39
Notation: o = x observations
s4
o4
c1
c2
...
c39
s5
o5
c1
c2
...
c39
s6
o6
c1
c2
...
c39
Remarks on recognition/decoding
The two main approaches to decoding differ in how their
search space is constructed
I
Dynamic decoders
I
Static WFST decoders
I
Both use the the Viterbi algorithm to conduct their search
Dynamic decoders follow the previous slide very closely
I
Expert knowledge/experimentation has been applied to
optimize the search algorithms
I
Much has been published but there are many unpublished
secrets!
Remarks on recognition/decoding (cont’d)
WFST-based decoders use WFSTs to represent the search
space
I
Standard WFST theory/operations are used to algorithmically
optimize the search space
I
The actual decode on this space is vanilla Viterbi
Acoustic model training overview: requirements
A large acoustic training corpus
I
Many thousands of hours of audio data
I
I
I
Drawn from a large speaker population
Similar to the target application
Corresponding word-level transcripts
A dictionary that covers all the words in the training corpus
The standard ASR formalism: training
w1
w2
w3
p1
s1
o1
c1
c2
...
c39
p2
s3
s2
o2
c1
c2
...
c39
o3
c1
c2
...
c39
Notation: o = x observations
s4
o4
c1
c2
...
c39
s5
o5
c1
c2
...
c39
s6
o6
c1
c2
...
c39
Acoustic model training overview: basic recipe
Train monophone HMMs
I
Use transcripts and dictionary to get the monophone
sequences: one HMM per monophone
Initialize triphone HMMs
I
Use the context in the phone sequences to construct
corresponding triphone sequences
I
I
Clone triphone models
I
I
“a b c” maps to “a+b a-b+c b-c”
HMM parameters for a-b+c come from b
Re-estimate triphone parameters
Cluster HMM triphone states using decision trees
Re-estimate clustered HMMs
Output distributions
While there are many potential choices for P(xt | st )
I
I
Mixture models with normal components have been dominant
until recently
Often called “Gaussian” mixture models (GMMs)
I
I
I
Gives Gauss (his usage in 1810) undeserved credit
De Moivre first in 1733
Even Laplace was earlier than Gauss (in the 1770’s)
Neural networks are currently the preferred choice
I
The hybrid HMM/NNETs framework was developed in the
late 80’s/early 90’s by Morgan and Bourlard (with many other
key contributors)
I
A later lecture will cover HMM/NNETs
I
HMM/GMMs are still very useful/instructive
Diagonal, multivariate normal distributions: review
A D-dimensional normal with diagonal covariance
I
Parameters: mean vector µ and variance vector σ 2
I
Write µd for the d th component
I
N (µ, σ 2 ) denotes the distribution
I
ϕ(x; µ, σ 2 ) denotes the probability density function (p.d.f.)
!
D
1 X (x − µd )2
1
2
ϕ(x; µ, σ ) = q
exp −
Q
2
σd2
(2π)D D σ 2
d=1
d
d=1
Mixture models of normals, i.e., GMMs
I
I
I
2 }M
M component weights, means, variances {wm , µm , σm
m=1
PM
m=1 wm = 1
PM
2
P.d.f.:
m=1 wm ϕ(x; µm , σm )
Multivariate normal distributions: review
Maximum likelihood estimates (MLEs)
I
A sample of T , D-dim training examples {xt }T
t=1
I
xt,d is example t’s d th component
I
MLE for the mean:
T
1 X
µ̂ =
xt
T
t=1
I
MLE for the variance in dim d:
σ̂d2 =
T
1 X
(xt,d − µ̂d )2
T
t=1
We are skipping MLEs for GMMs
HMMs with normal output distributions
For each state s ∈ S
I
P(x | s) = ϕ(x; µs , σs2 )
I
State dependent means and variances {µs , σs2 }Ss=1
Given a training sample {xt }Tt=1 if we knew what states at
each time t, st , were responsible for the xt then the MLEs
would be obvious
I
But we don’t since the state identities are hidden
Two approximate solutions to this problem
I
Viterbi training
I
Baum-Welch (BW) training
HMM with normal output distributions
Let θ denote the HMM model parameters
I
The state transition matrix A
I
State means and variances {µs , σs2 }Ss=1
I
Surface the existence of θ in the probability distributions Pθ
The Baum-Welch and Viterbi estimation algorithms
I
Are iterative
I
Produce sequences of model parameter estimates
I
θ̂0 , θ̂1 , . . . , θ̂n
We are skipping the estimates for transitions and mixture
models
I
Functionally equivalent and easy to work out
Estimate θ̂i+1 using θ̂i
Viterbi training
I
Uses the maximum likelihood estimate for the state identities
I
Uses the Viterbi algorithm and Pθ̂i
I
A hard, single choice for which state generates each frame
I
Kaldi uses Viterbi training
Baum-Welch training
I
Uses the distributions Pθ̂i (st | x1 , . . . , xT )
I
Uses the forward-backward algorithm
I
A soft, fractional choice which state generates each frame
I
HTK uses Baum-Welch training
Viterbi training: basic idea
First use the Viterbi algorithm to obtain
ŝ = argmax Pθ̂i (x1 , . . . , xT , s1 , . . . , sT )
s1 ,...,sT
I
ŝ depends on θ̂i so should write ŝθ̂i
I
We don’t to keep notation “cleaner”
I
View this as an assignment of frames to states
I
I.e., xt is generated by state ŝt
I
Called a Viterbi or forced state-level alignment
I
This means that state k’s training data is the set of frames
{xt : 1 ≤ t ≤ T and δk,ŝt = 1}
I
Frame xt ’s count for state k is δk,ŝt (depends on θ̂i )
Viterbi training: the new estimate θ̂i+1
We use the MLEs obtained on each state’s training data to
obtain the new parameter estimates θ̂i+1
I
MLE for state k’s mean:
PT
µ̂k = Pt=1
T
δk,ŝt xt
t=1 δk,ŝt
I
MLE for state k’s variance in dim d:
PT
δk,ŝ (xt,d − µ̂k,d )2
2
σ̂k,d = t=1 PTt
t=1 δk,ŝt
Baum-Welch training: basic idea
First use the forward-backward algorithm to obtain
Pθ̂i (st | x1 , . . . , xT )
I
View this as a probabilistic assignment of frames to states
I
I.e., xt is generated by state k with prob
Pθ̂i (st = k | x1 , . . . , xT )
I
Frame xt ’s count for state k is Pθ̂i (st = k | x1 , . . . , xT )
I
The counts are now fractional instead of integral
Baum-Welch training: the new estimate θ̂i+1
We use the MLEs obtained on each state’s training data to
obtain the new parameter estimates θ̂i+1
I
MLE for state k’s mean:
PT
t=1 Pθ̂i (st = k | x1 , . . . , xT )xt
µ̂k = PT
t=1 Pθ̂i (st = k | x1 , . . . , xT )
I
MLE for state k’s variance in dim d:
PT
2
t=1 Pθ̂i (st = k | x1 , . . . , xT )(xt,d − µ̂k,d )
2
σ̂k,d =
PT
t=1 Pθ̂i (st = k | x1 , . . . , xT )
How do we initialize, i.e., choose θ0 ?
Usually via a flat start
I
State transition probabilities use the uniform distribution
I
I
E.g. A1,1 = A1,2 = 0.5
Use the training data total mean and variance for all the
states means and variances:
µ̄ =
T
1 X
xt
T
t=1
σ̄d2
T
1 X
=
(xt,d − µ̄d )2 ∀d
T
t=1
What are Viterbi/Baum-Welch training optimizing?
Baum-Welch training
I
Baum-Welch model/parameter selection criterion is the
training data likelihood:
X
L(θ) =
Pθ (x1 , . . . , xT , s1 , . . . , sT )
s1 ,...,sT
I
Produces a parameter sequence (θ̂0 , θ̂1 , . . . , θ̂n ) satisfying
L(θ̂0 ) ≤ L(θ̂1 ) ≤ · · · ≤ L(θ̂n )
I
Baum-Welch is an example of an Expectation-Maximum
algorithm
I
I
Many interesting/important properties
Beyond the scope of this introduction
What are Viterbi/Baum-Welch training optimizing?
Viterbi training
I
Viterbi model/parameter selection criterion is an
approximation to the training data likelihood:
VL(θ) = max Pθ (x1 , . . . , xT , s1 , . . . , sT )
s1 ,...,sT
I
Produces a parameter sequence (θ̂0 , θ̂1 , . . . , θ̂n ) satisfying
VL(θ̂0 ) ≤ VL(θ̂1 ) ≤ · · · ≤ VL(θ̂n )
I
This is straightforward to verify: exercise!
A short, incomplete list of things we didn’t cover
Discriminative training
I
Alternatives to MLE using different model selection criteria
I
E.g., MMI and MPE
Model adaptation
I
E.g., given a small sample of data from a new speaker: can
we leverage extant models to more accurately predict future
samples?
I
Bayesian methods are ideal
I
MAP, MLLR, etc.
Reducing inter-speaker variability at training/test
I
Vocal tract length normalization (VTLN)
I
Speaker adaptive training (SAT)
Download