Training Procedure CMU

advertisement
Training Procedure
CMU
Required
1. A collection of training recordings
•
•
A set of speech recordings
The transcriptions for all recordings
2. A “dictionary” with pronunciation entries
for all words in the transcriptions
Step 1. Compute Features
1. Compute features (cepstra) for all
training recordings
2. Program to use: Wave2feat
Wave2feat is a sphinx feature computation tool:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
./SphinxTrain-1.0/bin.x86_64-unknown-linux-gnu/wave2feat
[Switch]
[Default]
[Description]
-help
no
Shows the usage of the tool
-example
no
Shows example of how to use the tool
-i
Single audio input file
-o
Single cepstral output file
-c
Control file for batch processing
-nskip
If a control file was specified, the number of utterances to skip at the head of the file
-runlen
If a control file was specified, the number of utterances to process (see -nskip too)
-di
Input directory, input file names are relative to this, if defined
-ei
Input extension to be applied to all input files
-do
Output directory, output files are relative to this
-eo
Output extension to be applied to all output files
-nist
no
Defines input format as NIST sphere
-raw
no
Defines input format as raw binary data
-mswav
no
Defines input format as Microsoft Wav (RIFF)
-input_endian little
Endianness of input data, big or little, ignored if NIST or MS Wav
-nchans
1
Number of channels of data (interlaced samples assumed)
-whichchan 1
Channel to process
-logspec
no
Write out logspectral files instead of cepstra
-feat
sphinx
SPHINX format - big endian
-mach_endian little
Endianness of machine, big or little
-alpha
0.97
Preemphasis parameter
-srate
16000.0
Sampling rate
-frate
100
Frame rate
-wlen
0.025625
Hamming window length
-nfft
512
Size of FFT
-nfilt
40
Number of filter banks
-lowerf
133.33334
Lower edge of filters
-upperf
6855.4976
Upper edge of filters
-ncep
13
Number of cep coefficients
-doublebw no
Use double bandwidth filters (same center freq)
-warp_type inverse_linear Warping function type (or shape)
-warp_params
Parameters defining the warping function
-blocksize 200000
Block size, used to limit the number of samples used at a time when reading very large audio files
-dither
yes
Add 1/2-bit noise to avoid zero energy frames
-seed
-1
Seed for random number generator; if less than zero, pick our own
-verbose
no
Show input filenames
Wave2feat Arguments: Explanation
• -help:
• Just typing wave2feat –help lists the
entire set of arguments and their default
values
Wave2feat Arguments: Explanation
• -example
• Ignore this flag
Wave2feat Arguments: Explanation
• -i
• -o
• If we want to compute cepstra for a single speech recording,
we use these flags.
• The command is
• wave2feat [additional ags] -i input_speech_file -o output_cepstrum_file
• Here wave2feat is called with any additional arguments
(described in the next few slides)
• The audio file to compute the cepstrum for is specified by the
–i flag
• The name of the cepstrum file to write out is specified by the
the –o flag
Wave2feat Arguments: Explanation
• -c
Control file for batch processing
• -di
Input directory, input file names are
relative to this, if defined
• -ei
Input extension to be applied to all
input files
• -do
Output directory, output files are
relative to this
• -eo
Output extension to be applied to all
output files
• Instead of processing data one file at a time, we may
want to simply pass a list of files to the program
• In this case we pass the program a file with a list of
audio files.
• Let LISTFILE be a file with a list of audio files
– Note this is not an audio file; it’s a text file with a list of audio
file names
Wave2feat Arguments: -c –di –ei –do -eo
• “-c LISTFILE”
• Let LISTFILE be a file with a list of audio files
– Note this is not an audio file; it’s a text file with a list of audio
file names
• LISTFILE has the following format:
DIR/FILE
DIR/FILE
DIR/FILE
…
• Each entry in this file has the complete relative path to
a file (relative to a root)
• Each entry includes the name of the file, but not
extension
• E.g. a file at $ROOT/RM/ind_trn/adg0_4/sa1.wav is
listed as “RM/ind_trn/adg0_4/sa1”
• Note – the extension “.wav” is not part of the entry
Wave2feat Arguments: -c –di –ei –do -eo
• “-c LISTFILE -di ROOT –ei EXT”
• ROOT is the base directory to which the entries in LISTFILE must be
appended
• EXT is the filename extension to append to get the actual names of the
audio files.
• E.g. we have two audios file at
• /usr/home/mydata/RM/ind_trn/adg0_4/sa1.wav
/usr/home/mydata/RM/ind_trn/aem0_3/sa2.wav
• We make a file called LISTFILE with the following entries
ind_trn/adg0_4/sa1
ind_trn/aem0_3/sa2
• The ROOT directory is /usr/home/mydata/RM
• Filename extensions are wav
• Wave2feat would be called as
•
wave2feat [additional args] –c LISTFILE –di /usr/home/mydata/RM –ei wav
Wave2feat Arguments: -c –di –ei –do -eo
• “-c LISTFILE -di ROOT –ei EXT –do OUTROOT –
eo OUTEXT”
• OUTROOT is the base directory into which the
output cepstrum files must be stored
• -OUTEXT is the filename extension that must be
employed
• E.g. the following commandwave2feat [additional args]
–c LISTFILE –di /usr/home/mydata/RM –ei wav –do
/usr/home/mydata/cep/RM –eo mfc
• Will read /usr/home/mydata/RM/ind_trn/adg0_4/sa1.wav and
store the cepstrum computed at
• /usr/home/mydata/cep/RM/ind_trn/adg0_4/sa1.mfc
Wave2feat Arguments: Explanation
• -nist
• -raw
• -mswav
no Defines input format as NIST sphere
no Defines input format as raw binary data
no Defines input format as Microsoft Wav
• These arguments specify the format of
the input speech files.
• one of them should be set to “yes”
– Setting none or more than one to “yes”
would not be good
Wave2feat Arguments: Explanation
• -input_endian little Endianness of input data,
big or little, ignored if NIST or MS Wav
• -mach_endian little Endianness of machine,
big or little
• These arguments specify the endianness
of the speech file and the native endian of
the machine respectively
– If the database was recently collected, odds
are that both of these are “little”
– If the data are in mswav or NIST format, it is
not necessary to set these variables anyway
Wave2feat Arguments: Explanation
• -nchans
1
Number of channels
of data (interlaced samples assumed)
• -whichchan
1
Channel to process
• The audio files are often multi-channel
(stereo, 4- or 8-channel recordings). “nchans” informs the machine about the
number of channels in the recording
• “-whichchan” specifies which of the
channels to compute features for
Wave2feat Arguments: Explanation
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
-logspec
no
Write out logspectral files instead of cepstra
-feat
sphinx
SPHINX format - big endian
-alpha
0.97
Preemphasis parameter
-srate
16000.0
Sampling rate
-frate
100
Frame rate
-wlen
0.025625
Hamming window length
-nfft
512
Size of FFT
-nfilt
40
Number of filter banks
-lowerf
133.33334
Lower edge of filters
-upperf
6855.4976
Upper edge of filters
-ncep
13
Number of cep coefficients
-doublebw no
Use double bandwidth filters (same center freq)
-warp_type inverse_linear Warping function type (or shape)
-warp_params
Parameters defining the warping function
-blocksize 200000
Block size, used to limit the number of samples
used at a time when reading very large audio files
-dither
yes
Add 1/2-bit noise to avoid zero energy frames
-seed
-1
Seed for random number generator; if less than
zero, pick our own
-verbose
no
Show input filenames
Wave2feat Arguments: Explanation
•
-logspec
–
•
•
•
•
•
•
SPHINX format - big endian
An arcane flag that specifies the format of the output cepstrum (or logspectrum) file. Ignore this flag
0.97
Preemphasis parameter
This is the pre-emphasis factor to apply for feature computation. Leave it at its default value
-srate
–
•
•
•
•
•
•
•
•
•
•
•
sphinx
-alpha
–
Write out logspectral files instead of cepstra
Whether to simply write out mel log spectral features instead of cepstra. This basically skips the final DCT
step used to compute cepstra
-feat
–
no
16000.0
Sampling rate
The sampling rate with at the speech has been sampled
-frate
100
Frame rate
-wlen
0.025625
Hamming window length
-nfft
512
Size of FFT
-nfilt
40
Number of filter banks
-lowerf
133.33334
Lower edge of filters
-upperf
6855.4976
Upper edge of filters
-ncep
13
Number of cep coefficients
-doublebw no
Use double bandwidth filters (same center freq)
-warp_type inverse_linear Warping function type (or shape)
-warp_params
Parameters defining the warping function
-blocksize 200000
Block size, used to limit the number of samples used at a time when
reading very large audio files
-dither
yes
Add 1/2-bit noise to avoid zero energy frames
-seed
-1
Seed for random number generator; if less than zero, pick our own
-verbose
no
Show input filenames
Required
1. A collection of training recordings
•
•
A set of speech recordings
The transcriptions for all recordings
2. A “dictionary” with pronunciation entries
for all words in the transcriptions
3. A “filler” dictionary with pronunciations for
noise words and silences
4. A “phoneme list”
Format of list of training data
• A list of training recordings is required in the following
format:
DIR/FILE
DIR/FILE
DIR/FILE
…
• Each entry in this file has the complete relative path to a
file (relative to a root)
• Each entry includes the name of the file, but not
extension
• E.g. a file at $ROOT/RM/ind_trn/adg0_4/sa1.wav is
listed as “RM/ind_trn/adg0_4/sa1”
• Note – the extension “.wav” is not part of the entry
Alternate format of list of training data
• Sometimes the recordings are long files. In this case we use the
following format:
DIR/FILE
STARTING_FRAME ENDING_FRAME TAG
DIR/FILE
STARTING_FRAME ENDING_FRAME TAG …
• DIR/FILE is the (full) relative path to the cepstrum file
• STARTING_FRAME is the frame number at which an utteraence
begins
• ENDING_FRAME is the frame number at which it ends
• If we have computed features at 100 frames / second, and a person
speaks from time 10.324 seconds till 12.333, then STARTING_FRAME
is 1032 and ENDING_FRAME is 1233
• TAG is a unique tag for each line in the list
• JARGON: WE WILL CALL THIS LIST FILE A “CONTROL FILE”
The Transcription File
• The transcription file has a number of text
strings
• There are exactly as many text strings as
the number of entries in the control file
– Regardless of the format of the control file
(one file per line or using time stamps)
• Each line of text represents what was
spoken in the recording represented by
the corresponding line in the control file
The Transcription File
• Each line in the transcription file has the
following format
• BETTY BOTTER BOUGHT SOME BUTTER
(TAG)
• The words represent the sequence of words
spoken
• The “TAG” is either the name of the file
– If the control file was in one-file-per-line format
• Or the actual TAG used in the corresponding line
of the control file
The Transcription File
• The transcriptions in the transcription file should
ideally also identify any extraneous noises and
disfluencies
– E.g.
BUT ++UM++ THE BU- BU- BUTTER WAS ++UH++
++NOISE_IN_BACKGROUND++ ++SCREAM++
++SLAPPING_NOISE++ BITTER
• The terms in ++*++ are the extraneous sounds
• The BU- BU- represent exactly what was spoken – the
speaker stuttered
The Dictionary File
• The Dictionary file contains prounciations of all the words
in the transcriptions
• BETTY
B EH T IY
BOTTER B AA T ER
BOUGHT B AO T
SOME
S AH M
BUT
B AH T
BUB AH
BUTTER B AH D AXR
BUTTER(2) B AH T ER
• Each line represents the pronunciation of one of the
words
• Entries with (?) represent alternate pronunciations of the
word
– E.g. there are two ways of representing BUTTER
• Note: The partial word “BU-” is also listed with its
pronunciation
The Filler Dictionary
• We also need a separate filler dictionary
• This lists all the non-speech sound encountered in the
transcripts and would look like this
• <s>
SILENCE
</s>
SILENCE
<sil>
SILENCE
•
•
•
•
++UM++
+UM+
++UH++
+UH+
++NOISE_IN_BACKGROUND++ +NOISE+
++SCREAM++
+NOISE2+
++SLAPPING_NOISE++
+NOISE+
The first three entries must always be part of the filler dictionary
The remaining terms are for non-speech sounds in the data
The terms to the right are “Noise” phonemes
Multiple sounds may be mapped onto the same noise phoneme
The phoneme list
• The phoneme list is a file that contains all
the phonemes used in the dictionary and
filler dictionary
– Including silence and noise phonemes
• This can be created by simply running:
– Cat $dictionary $fillerdict | awk
‘{for(i=2;i<=NF;i++) print $i}’ | sort –u >!
$PHONELIST
Training: Prelimaries
• In the first pass of training, we add silences at
the beginning and end of every utterance in
the transcriptions
• So we modify entries such as the following
BETTY BOTTER BOUGHT SOME BUTTER (buying01)
• To
<s> BETTY BOTTER BOUGHT SOME BUTTER </s>
(buying01)
Preliminaries: Setting Parameters
• The following are specific to Rita’s setup
– The sphinxtrain setup may be slightly different
• Modify “variables.def” to point to all the right
things:
set workdir = BLAH
– This is the directory where all the training is being
performed
set base_dir = BLAH
– This is the root directory where the output of the
training including log files, buffers, models etc. will be
created
Preliminaries: Setting Parameters
set bindir = BLAH
– This is the directory in which all the recognizer
executables live
set listoffiles = BLAH
– This is complete path to the control file with
the list of training files
Set transcriptfile = BLAH
– This is the complete path to the file with
transcriptions
Preliminaries: Setting Parameters
set dictionary = BLAH
– This is the complete path to the dictionary file
set fillerdict = BLAH
– This is complete path to the filler dictionary
set phonefile = BLAH
– This is the complete path to the file with the list of
phonemes
set featfiles_dir = BLAH
– This is the directory where the feature files reside.
This would be the same directory that was given to
wave2feat with the –do flag
Preliminaries: Setting Parameters
set featfile_extension = BLAH
– This is the file extension to apply to the files
names in the listfile to get the complete name
of the feature file. This would be the same
argument specified to wave2feat with the –eo
flag
Preliminaries: Setting Parameters
set featfile_extension = BLAH
– This is the file extension to apply to the files
names in the listfile to get the complete name
of the feature file. This would be the same
argument specified to wave2feat with the –eo
flag
set vector_length = BLAH
– This specifies the dimensionality of our
cepstral vector. Typical value to 13
Preliminaries: Setting HMM Parameters
set statesperhmm = BLAH
– How many states in the HMM. Typical values or 3 or 5
set skipstate = BLAH
– Answer should be “yes” or “no”. This specifies
whether the topology of the HMMs should allow
transitions from state N to state N+2 directly or not.
For 3-state HMMs, this should be “no” and for 5-state
HMMs it should be “yes”
set gaussiansperstate = BLAH
– How many Gaussians are wanted in the Gaussian
mixture state output distributions. If we have more
training data, we’d want this number to be larger
Preliminaries: Setting HMM Parameters
set agc = none
– “Automatic Gain Control”. Just set to “none” and do
not mess with this
set cmn = current
– How to perform cepstral mean normalization. Set to
“current” and do not mess with this
set varnorm = BLAH
– Do we want to perform variance normalization on our
features? Answer “yes” or “no” only.
If the answer is “yes” variance normalization must be
performed during recognition also
Preliminaries: Setting HMM Parameters
set feature = 1s_c_d_dd
– The type of “extended” feature to train models with. 1s_c_d_dd
uses cepstra (c), delta cepstra (d) and double-delta cepstra (dd).
Other features are also available in the sphinx.
The feature used for decoding must be identical to the one used
to train the model.
set type = .cont.
– We can perform different kinds of parameter sharing. When the
type is “.cont”, we let each tied-state be independent.
Another variant is .semi. You don’t normally want to use this
set n_tied_states = BLAH
– How many tied states to learn in total, across all triphones of all
phonemes
Preliminaries: Setting HMM Parameters
set npart = ?
– We can parallalize the training into many parts. More is better,
but only if we have more processeors.
set n_part_untied = ?
– There are 3 stages of training – CI, untied and tied. CI and tied
are less bandwidth intensive than untied. So the untied portion is
better run in fewer parts than the other two stages of training.
This variable specifies how many parts to run untied training in
set convergence_ratio = BLAH
– The training continues at each stage until the likelihood of the
training data converges. This value (e.g. 0.00001) specifies the
minimum relative increase at likelihood when we stop iterating
set maxiter = BLAH
– We also specify a maximum number of iterations to perform,
regardless of likelihood convergence
Training: Stage 1 – context
independent training
•
Train context independent models.
– On Rita’s Setup
• >cd 01.ci_chmm
• >./slave_convg.csh
•
•
Slave_convg will train context-independent models
Process:
– Slave_convg launches a number of “baum_welch” jobs and a norm_and_launch_bw job
– The norm_and_launch_bw only runs when all baum_welch jobs have run to completion
•
•
•
Baum_welch computes a set of “buffers” to compute the next iteration of models
from
Norm_and_launch_bw gathers these buffers, computes a new set of models, and
calls slave_convg to launch jobs for the next iteration
If the training has converged, norm_and_launch_bw simply invokes stage 2 of
training, instead of launching further interations
Training: Stage 2 – “untied” context
dependent training
• >cd 02.cd_chmm
• >./slave_convg.csh
• Slave_convg will train context-dependent models without parameter
sharing of any kind
• Process: as before slave_convg launches baum_welch jobs and a
norm_and_launch_bw job. The latter aggregates baum_welch output and
calls slave_convg for the next iteration
• The output of this stage of training is a set of models for contextdependent phonemes (triphones)
• These models do not share state output distributions
– Every triphone has its own unique set of state output distributions
– These distributions are Gaussians
• Upon convergence norm_and_launch_bw calls the next stage
Training: Stage 3 – Decision Test
• > cd 03.buildtrees
• > ./slave_treebuilder.csh
• The “untied” context-dependent models are used to train
decision trees for state tying
• The “slave” script launches one job for each state of each
phoneme
– A separate decision tree is trained for each state of each phoneme
• > ./prune_tree_and_tie_state.csh
– This prunes all the trees and builds a lookup table (mdef) that stores
the index of the tied state for each state of each triphone
– It then launches stage 4 of training
Training: Stage 4 – Tied-State Models
• > cd 04.cd_chmm
• > ./slave_convg.csh
• The slave launches baum_welch jobs and a
norm_and_launch_bw job
– The norm_and_launch_bw aggregates baum-welch outputs,
reestimates models, and if convergence has not happened, calls
slave_convg
– Upon convergence, if the desired no. of Gaussians has not been
obtained, norm_and_launch_bw first calls a job that splits Gaussians
before relaunching the next set of baum-welch jobs
– If convergence has happened and the desired number of Gaussians
have been obtained, norm_and_launch_bw exits
Download