PPT slides

T.J. Watson Research Center, Human Language Technologies
Improvements to fMPE
Dan Povey
T.J. Watson Research Center, Human Language Technologies
 Review of fMPE
 Mean offsets as features
 Multiple layer framework
 Context expansion in multiple layer framework
 Improved way of setting learning rate
 Improved way of setting per-dimension scales on learning rate
 “Smooth update” – more stable update rule
 “Out of the box training”
 Diagnostics
 Other issues
 What is most important?
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Review of fMPE (1 of 3, overview)
 In fMPE, we train a nonlinear offset to the features:
 yt = xt + M ht
 ht is a high-dimensional vector and a function of xt
(and maybe context frames xt-1, xt+1 etc).
 The transformation parameters M are trained using the MPE
objective function, using a modified form of gradient descent.
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Review of fMPE (2 of 3, features)
 The high dimensional features ht are (in the original
implementation) a vector of Gaussian posteriors with frame
Obtain 100,000 Gaussians by clustering HMM set
Calculate Gaussian posteriors (model-free) on each frame
Splice vectors on adjacent frames together to create a larger
vector (actually, splice together frames and averages of
frames for larger context window).
Vector ht is very sparse (even though M is not), so calculations
are fast.
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Review of fMPE (3 of 3, training)
 Specific learning rates for each parameter Mij obtained by
accumulating positive and negative contributions to  F/ Mij and
dividing by the sum of the absolute value of both.
 Compensate for different dimensions of the feature vector having
different average variance.
 The differential w.r.t. matrix element Mij contains an “indirect” term
reflecting changes that will happen in the means and variances
when we re-train the system. This is necessary because the HMM
parameters are trained with ML while the matrix is trained with MPE
Features affect means & vars,
means & vars affect objective function
! differentiate back through the process.
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Mean offsets as features
 Probably the most important change (results already given in last
EARS meeting):
 Using far fewer Gaussians (e.g. 1000 instead of 100,000) and
adding the offsets of the observed features from the mean.
 If the posteriors were [1, 2…], we are now using:
[ 5.0 1, 1 (xt(1)-1(1))/1(1), 1 (xt(2)-1(2))/1(2) …
5.0 2, 2 (xt(1)-2(1))/2(1), 2 (xt(2)-2(2))/2(2) …]
 Each posterior followed by offset of the feature from the mean.
 Divide by n to ensure equal scales on all offsets.
 5.0 is a scale to put more weight on the posterior itself.
 For 1000 Gaussians, the final dimension of the feature ht would
be 1000 (d +1) for d-dimensional features (ignoring frame
 Improves both accuracy and speed.
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Multiple layer framework
 Motivation: using mean offsets combined with frame averaging and
splicing reduces sparsity of ht to the point where training takes
much longer.
 Need to reorganize the calculation into multiple stages.
 Developed a code framework where features can undergo multiple
layers of processing and propagate differentials back to previous
 Using multiple modules with a normalized interface (e.g. a layer
doing a linear transformation would be called in the same way as a
layer calculating Gaussian posteriors)*
 Makes it very easy to add new kinds of processing (just copy,
rename and modify an existing module).
 Setup controlled by a config file
*except some features need to be stored sparsely
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Context expansion in multiple layer framework (1 of 2)
 Previously, would calculate ht explicitly (including splicing) and then
project. But with mean offsets & splicing it is not sparse enough.
 Now, calculate the “single-frame” ht with no splicing (e.g. of size
1000 * d+1) and project it to a multiple of d, e.g. 9d, then splice and
project to d.
ht ! M1 ht ! M2 (M1 ht, M1 ht+1, M1 ht-1 .. )
(dimension): 1000(d+1) ! 9d
 Splice the 9d-dimensional feature across e.g. 80 frames and project
down to d with a projection s.t. each output dimension only “sees”
1/d of the input dimensions. #parameters = 9 * 80 * d.
 Initialize projection to be equivalent to original context expansion, so
the first of the 9 contexts gets projected from the central frame, the
second gets projected only from one frame to the right, etc.
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Context expansion in multiple layer framework (2 of 2)
 I neglected to mention in the paper that…
 the context expansion layer is trained with held-out data (one out of
every 10 files).
 Otherwise, it tries to scale up the fMPE contribution as much as it
can to maximize overtraining.
 This is a problem with all setups that involve multiplying two fMPE
trained things.
 [ Note – I do not bother making sure that the source of the “indirect”
contributions to the differential was also held out. ]
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Improved method of setting learning rates
 When changing setups, the appropriate learning rate (controlled
by E) can change.
 Set a “target” criterion improvement for the first iteration and set E
on the first iteration based on that.
 Use the same value of E in subsequent iterations.
 Using 0.06 for the main (first) matrix and 0.007 for the second
(context expansion) layer.
 Reduce these values for low-WER domains.
 Note that the context expansion layer is trained only from the
second iteration since the differential would be zero on the first
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Improved method of setting per-dimension learning rates
 The original per-dimension learning rates included factor I (an
average standard deviation) for matrix element Mij to have the
appropriate scale for the target dimension being added to.
 This did not seem to work well for MFCC parameters: got wide
variation in contribution to criterion improvement between
dimensions (perhaps broken by extreme values).
 Replace i with 1/sqrt(Si), where Si is average squared value of
summed positive and negative contributions to each  F/ Mij.
 Gives better ratio between learning rates for different dimensions.
 If E is set automatically as described above, the overall learning
rate will be appropriate.
EARS progress update
T.J. Watson Research Center, Human Language Technologies
“Smooth update”
 When training context expansion, sometimes an instability
appeared for certain dimensions.
 Developed a method to detect and stop instabilities.
 Intuition – if too many parameters are changing direction and
moving farther than last time, the learning rate is too high.
Too far
(1) Define a set of meaningful subsets of matrix parameters (e.g.
matrix rows, columns).
(2) For each subset in decreasing order of size: if for more than
10% of the parameters p in the subset, the value on iteration pn is
on the opposite side of pn-2 from pn-1, reduce the learning rate for
that subset until this no longer holds (i.e. move the parameters pn
towards pn-1).
EARS progress update
T.J. Watson Research Center, Human Language Technologies
“Out of the box” training
 The reason for many of the changes described above is to obtain a
setup that will work on different domains without tuning.
 E.g. new methods of setting learning rates
 And “smooth update” which can neutralize the effect of a learning
rate that has been set too fast.
 fMPE reliably gives improvements without tuning
E.g. recently trained some acoustic models for fast transcription of call
center data (no adaptation). fMPE+MPE improved results by 8.5%
absolute from a 45% baseline.
For small-vocabulary task, fMPE+MPE improved results by 30% relative
from a 1.20% baseline.
 Note – I now always use the same acoustic scales as normal MPE
(e.g. 0.1 or 0.05, or inverse of normal LM scale if preferred).
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Always use plenty of diagnostics. E.g.  Per-dimension measures of predicted criterion improvement and sign
 The overall predicted and observed criterion improvement;
 Check for indirect & direct differentials canceling overall (see paper);
 Look at average size of fMPE contributions to features;
 Check distribution of data among Gaussians used to calculate
 Use measures of difference between HMM sets.
 Print out plenty of graphs and histograms where appropriate.
“It doesn’t work” is not enough information to fix it if it’s broken.
EARS progress update
T.J. Watson Research Center, Human Language Technologies
Other issues investigated
 Sigmoid layers – no improvement.
 Momentum update rule – no improvement.
 Training “variances” on features (a quantity added to (x-)2
quantities in training and test) – this gave some improvement ~12% relative.
 Training multiple systems on the same data in parallel, sharing only
the fMPE transform – should multiply effective amount of data
(seems to help ~1-2% relative).
 Note - I don’t know whether the way to obtain the Gaussians is
critical. Jasha Droppo (Microsoft) suggests training a GMM on the
features with a globally tied variance.
EARS progress update
T.J. Watson Research Center, Human Language Technologies
What is most important?
 Use appropriate learning rate for your features (e.g. set target
Setting learning rate too fast can cause dramatic instability.
Setting it too slow can cause very slow convergence.
 Use the indirect differential if you want to train on fMPE features.
 Use frame splicing for acoustic context.
 Need a baseline discriminative training setup that works (e.g. lattice
EARS progress update