Lecture 17½ - Center for Spoken Language Understanding

advertisement
CS 552/652
Speech Recognition with Hidden Markov Models
Winter 2011
Oregon Health & Science University
Center for Spoken Language Understanding
John-Paul Hosom
Lecture 17½
Speaker Adaptation
Notes based on:
Huang, Acero, and Hon (2001), “Spoken Language Processing” section 9.6
Lee and Gauvain (1993), “Speaker Adaptation based on MAP Estimation of HMM paramters”, ICASSP 93
Woodland (2001), “Speaker Adaptation for Continuous Density HMMs: A Review” 2001.
Gauvain and Lee (1994), “Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of
Markov Chains”
Renals (2008) speaker adaptation lecture notes.
Lee and Rose (1996), “Speaker Normalization Using Efficient Frequency Warping Procedures”
Panchapagesan and Alwan (2008), “Frequency Warping for VTLN and Speaker Adaptation by Linear
Transformation of Standard MFCC”
Speaker Adaptation
Given an HMM that has been trained on a large number of people
(a speaker-independent HMM), we can try to improve
performance by adapting to the speaker currently being
recognized in testing.
Two basic types of speaker adaptation:
1. Adaptation of the feature space (speaker normalization)
Vocal-Tract Length Normalization (VTLN)
= warp the feature space to better fit the model parameters
2. Adaptation of the model parameters
Maximum A Posteriori (MAP) adaptation
= retrain individual state parameters
Maximum Likelihood Linear Regression (MLLR)
= “warp” model parameters to better fit adaptation data
Speaker Normalization
Common technique is Vocal Tract Length Normalization (VTLN)
Assumption: The majority of speaker differences in the acoustic
space are caused by different vocal tract lengths. Different
lengths of the vocal tract can be normalized using a non-linear
frequency warping (like Mel scale, but on speaker-by-speaker
basis).
Performance using VTLN typically improves by a relative
reduction in error of 10% (e.g. from 22% WER to 20% WER,
or 10% to 9%, or 5% to 4.5%).
Two questions need to be answered to implement VTLN:
1. What type of non-linear warping
2. How to determine optimal parameter value for the non-linear
warping during both training and recognition?
Speaker Normalization
With different lengths of vocal tract, the resonant frequencies
(formants) shift. A shorter vocal tract yields higher formants; a
longer vocal tract yields lower formants. But the shift is not a
linear function of frequency. So, need to choose a non-linear
warping function.
1. what type of non-linear warping?
piecewise linear
adjustment of Mel scale
power function
Also, what range for parameters? If we consider vocal tract
lengths to be correlated with a person’s height, then we can look
at variation in height to determine range of vocal tract lengths. In
U.S., average male is 5’10” (1.776m) and has VTL of 17 cm. A
tall man might be 6’6”, or 11% taller than average. An average
woman is 5’ 4”, or 90% of the average male height. A short
woman might be 85% of the average male height.
Speaker Normalization
warping of Mel scale: Mel  2595log10
f

 1
 700 
=0.85
frequencies for warping with 
from 0.85 to 1.10 by 0.05
=1.0
=1.10
frequencies for no warping (=1)
(equation from Huang, Acero, Hon “Spoken Language Processing” 2001 p. 427)
Speaker Normalization
piecewise linear warping:
(figure from Renals’ ASR lecture, 2008)
Speaker Normalization
warping by power function: fˆ   3 f / 8000 f
(figure from Renals’ ASR lecture, 2008)
formant frequencies for
vocal tract lengths:
85%, 90%, 95%, 100%, 105%,
and 110% of 17 cm
Speaker Normalization
Actual estimated warping for different vocal tract lengths, based
on two-tube model of four vowels (/ax/, /iy/, /ae/, /aa/; tube
parameter values taken from CS551 Lecture 9):
85%=14.4cm
100%=17cm
110%=18.7cm
formant frequencies for 17-cm vocal tract
So, complexity of non-linear warping actually isn’t warranted; a
linear model fits theoretical data well, or -warping of Mel scale
Speaker Normalization
2. how to determine optimal parameter value during both training
and recognition?
• “Grid Search”: try 13 regularly-spaced values from
0.88 to 1.12, and find the value that maximizes the likelihood
of the model. (Linear increase in processing time) (Lee and
Rose, 1996).
• Use gradient search instead of grid search.
• Estimate and align (along frequency scale) formant peaks in
speaker data. For example, ratio of median position of 3rd
formant for current speaker divided by median F3 averaged
over all speakers (Eide and Gish, 1996):
 speaker 
median( F 3speaker )
median( F 3all )
Maximum a Posteriori (MAP) Adaptation of Model Parameters
If we have some (labeled) data for a speaker, we can adapt our
model parameters to better fit that speaker using MAP adaptation.
Sometimes just the means are updated; covariance matrix is
assumed to be the same, as are transition probabilities and
mixture weights. We also assume that each aspect (means,
covariance matrix, etc.) can be treated independently.
Maximum Likelihood estimation: ML 
MAP estimation:
MAP 
arg max

arg max

f (O |  )
f (O |  ) g ( )
g()
where g() is the prior probability distribution of the model over
the space of model parameter values. (If we know nothing about
g(), the prior probability of the model, then MAP reduces to ML
estimation.)
parameter space 
original paper on MAP:
Lee and Gauvain, ICASSP 1993
Maximum a Posteriori (MAP) Adaptation of Model Parameters
What do we know about g(), the prior probability density function
of the new model? Usually, we don’t know g(), so we use
maximum-likelihood (EM) training. However, in this case, we
have an existing, speaker-independent (S.I.) model (know, prior
information) and we want to learn the model for a specific speaker.
If we assume that each of the parts of the GMM model (, ,
weights) are independent, we can optimize each of these subproblems independently.
For the D-dimensional Gaussian distributions characterized by 
and , the prior density g() can be represented with a normalWishart density, with the following parameters: >D-1, >0.
The normal Wishart pdf also has a vector nw being the mean of the
Gaussian of the speaker-independent model, and a matrix S being
the covariance matrix from the speaker-independent model.
Maximum a Posteriori (MAP) Adaptation of Model Parameters
Using the Lagrange multiplier similar to the EM derivation
(Lecture 12) applied to this normal-Wishart pdf, the update formula
for the means of the model  becomes:
T
μˆ ik 
 ik μ nwik    t (i, k )o t
t 1
T
 ik    t (i, k )
ot = observations for new speaker
 = probabilities for new speaker
 = from SI model
t 1
μ nwik is the mean of the S.I. model for state i, component k
• ik, the weight contribution of prior knowledge (the S.I. model)
and new observed data (the speaker-dependent data), is
determined empirically. This controls the rate of change of μˆ ik
• t(i,k) is the probability of being in state i and component k at time
t given the speaker-dependent data and model (Lecture 11).
This is updating of the means is iterated, just like EM. Each
iteration changes the t(i,k) values, and therefore the μˆ ik
Maximum a Posteriori (MAP) Adaptation of Model Parameters
“When ik is large, the prior density is sharply peaked around the
values of the seed (S.I.) HMM parameters which will be only
slightly modified by the adaptation process. Conversely, if ik is
small, the adaptation will be very fast” (Lee and Gauvain (1993), p. 560).
When this weight is small, the effect of the S.I. model is smaller,
and the speaker-specific observations dominate the computation.
As the number of observations of the new speaker increases for
state j and component k (or, as T approaches infinity), the MAP
estimate approaches the ML estimate of the new data, as the new
data dominate over the old mean μ nwik.
The same approach can be used to adjust the covariance matrix.
ik can be constrained to be the same for all components in all
GMMS and states; a typical value is between 2 and 20.
Maximum a Posteriori (MAP) Adaptation of Model Parameters
“MAP HMM can be regarded as an interpolated model between
the speaker-independent and speaker-dependent HMM. Both are
derived from the standard ML forward-backward algorithm.”
(Huang, p. 447)
How much data is needed? Of course, more is better. Results
have been reported for only several utterances per new speaker up
to 600 utterances per new speaker.
Problem 1: need (relatively) lots of training data for the speaker
to be adapted to.
Problem 2: each state and component is updated independently.
If a speaker doesn’t say data associated with a particular state and
component, then that state still uses the S.I. model. It would be
nice to update all the parameters of the model from a small
amount of data.
Maximum Likelihood Linear Regression (MLLR)
• The idea behind MLLR is to use a set of linear regression
transformation functions to map means (and maybe also
covariances) in order to maximize the likelihood on the
adaptation data.
• In other words, we want to find some linear transform (of the
form ax+b) that warps the mean vector in such a way that the
likelihood of the model given the new data, L ( | O new ) ,
is maximized. (In the following, ot is one frame of Onew.)
• Updating only the means is effective; updating the covariance
matrix gives less than an additional 2% error reduction (Huang,
p. 450) and so is less commonly done.
• The same transformation can be used for similar GMMs; this
sharing allows updating of the entire model faster and
uniformly.
Maximum Likelihood Linear Regression (MLLR)
The mean vector μ ik for state i, component k can be transformed
using the following equation:
~  A μ b
μ
ik
c ik
c
where Ac is a regression matrix and bc is an additive bias vector;
Ac and bc are associated with a broad class of phonemes or set of
tied states (not just an individual state), called c, to better share
model parameters.
We want to find Ac and bc such that the mismatch with new
(speaker-specific) data is smallest. We can re-write this as
~ Wμ
μ
ik
c
ik
1 
where ik is rewritten as   and we need to solve for
μ ik 
Wc, which contains both Ac and bc, e.g. Wc =[bc, Ac]
Maximum Likelihood Linear Regression (MLLR)
Maximizing a Q function by setting the derivative to zero, in the
same way that was done in Lecture 12, maximizes the likelihood
of the adaptation data (Huang p. 448-449); this yields the function
T
 
1
(
i
,
k
)
Σ
t
ik o t μik

t 1 bik c
where
Ζ
V W D
ik
c
ik
bik c
T
1

(
i
,
k
)
Σ
ik o t μik
 t
t 1 b ik c
Vik 
T
1

(
i
,
k
)
Σ
ik
t
t 1
Dik  μik μik
1

(
i
,
k
)
Σ
ik Wc μ ik μik
 t
t 1 bik c
which can be re-written as
Z
T
Maximum Likelihood Linear Regression (MLLR)
If the covariance matrix ik is diagonal, there is a closed-form
solution for Wc:
Wq  ZqG q1
where subscript q denotes the qth row of matrix Wc and Z;
Gq 
v
qq Dik
bik c
where vqq denotes the qth diagonal element of Vik
We need to make sure that Gq is invertible, by having enough
training data. If there’s not enough data, we can tie more classes
together.
This process can be iterated with new values for t(i,k) and ik in
each iteration, but usually one iteration gives the most gain in
performance.
Maximum Likelihood Linear Regression (MLLR)
Unsupervised adaptation can be done by (a) recognizing with a
speaker-independent (S.I.) model, and then (b) assuming that
these recognized results are correct, using these results as training
data for adaptation. (In this case, the use of confidence scores
(indicating which regions of speech are better recognized) may be
helpful to constrain the training to only adapt to correctlyrecognized speech samples.)
MLLR and MAP can be combined for (slightly) better
performance over either technique alone. Also, MLLR and VTLN
performance improvement is often approximately additive. For
example, a 10% relative WER reduction from VTLN and 15%
relative WER reduction from MLLR in isolation yields a 25%
relative WER from using both VTLN and MLLR.) (Pye and
Woodland, ICASSP 97)
Maximum Likelihood Linear Regression (MLLR)
One example of combining MAP and MLLR is from the Whisper
system:
or 15% WER reduction using MLLR and a total 22% relative
error reduction on 1000 utterances from combined MAP+MLLR.
(The speaker-dependent system was trained on the 1000
utterances from that speaker.)
Download