Training Procedure CMU Required 1. A collection of training recordings • • A set of speech recordings The transcriptions for all recordings 2. A “dictionary” with pronunciation entries for all words in the transcriptions Step 1. Compute Features 1. Compute features (cepstra) for all training recordings 2. Program to use: Wave2feat Wave2feat is a sphinx feature computation tool: • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ./SphinxTrain-1.0/bin.x86_64-unknown-linux-gnu/wave2feat [Switch] [Default] [Description] -help no Shows the usage of the tool -example no Shows example of how to use the tool -i Single audio input file -o Single cepstral output file -c Control file for batch processing -nskip If a control file was specified, the number of utterances to skip at the head of the file -runlen If a control file was specified, the number of utterances to process (see -nskip too) -di Input directory, input file names are relative to this, if defined -ei Input extension to be applied to all input files -do Output directory, output files are relative to this -eo Output extension to be applied to all output files -nist no Defines input format as NIST sphere -raw no Defines input format as raw binary data -mswav no Defines input format as Microsoft Wav (RIFF) -input_endian little Endianness of input data, big or little, ignored if NIST or MS Wav -nchans 1 Number of channels of data (interlaced samples assumed) -whichchan 1 Channel to process -logspec no Write out logspectral files instead of cepstra -feat sphinx SPHINX format - big endian -mach_endian little Endianness of machine, big or little -alpha 0.97 Preemphasis parameter -srate 16000.0 Sampling rate -frate 100 Frame rate -wlen 0.025625 Hamming window length -nfft 512 Size of FFT -nfilt 40 Number of filter banks -lowerf 133.33334 Lower edge of filters -upperf 6855.4976 Upper edge of filters -ncep 13 Number of cep coefficients -doublebw no Use double bandwidth filters (same center freq) -warp_type inverse_linear Warping function type (or shape) -warp_params Parameters defining the warping function -blocksize 200000 Block size, used to limit the number of samples used at a time when reading very large audio files -dither yes Add 1/2-bit noise to avoid zero energy frames -seed -1 Seed for random number generator; if less than zero, pick our own -verbose no Show input filenames Wave2feat Arguments: Explanation • -help: • Just typing wave2feat –help lists the entire set of arguments and their default values Wave2feat Arguments: Explanation • -example • Ignore this flag Wave2feat Arguments: Explanation • -i • -o • If we want to compute cepstra for a single speech recording, we use these flags. • The command is • wave2feat [additional ags] -i input_speech_file -o output_cepstrum_file • Here wave2feat is called with any additional arguments (described in the next few slides) • The audio file to compute the cepstrum for is specified by the –i flag • The name of the cepstrum file to write out is specified by the the –o flag Wave2feat Arguments: Explanation • -c Control file for batch processing • -di Input directory, input file names are relative to this, if defined • -ei Input extension to be applied to all input files • -do Output directory, output files are relative to this • -eo Output extension to be applied to all output files • Instead of processing data one file at a time, we may want to simply pass a list of files to the program • In this case we pass the program a file with a list of audio files. • Let LISTFILE be a file with a list of audio files – Note this is not an audio file; it’s a text file with a list of audio file names Wave2feat Arguments: -c –di –ei –do -eo • “-c LISTFILE” • Let LISTFILE be a file with a list of audio files – Note this is not an audio file; it’s a text file with a list of audio file names • LISTFILE has the following format: DIR/FILE DIR/FILE DIR/FILE … • Each entry in this file has the complete relative path to a file (relative to a root) • Each entry includes the name of the file, but not extension • E.g. a file at $ROOT/RM/ind_trn/adg0_4/sa1.wav is listed as “RM/ind_trn/adg0_4/sa1” • Note – the extension “.wav” is not part of the entry Wave2feat Arguments: -c –di –ei –do -eo • “-c LISTFILE -di ROOT –ei EXT” • ROOT is the base directory to which the entries in LISTFILE must be appended • EXT is the filename extension to append to get the actual names of the audio files. • E.g. we have two audios file at • /usr/home/mydata/RM/ind_trn/adg0_4/sa1.wav /usr/home/mydata/RM/ind_trn/aem0_3/sa2.wav • We make a file called LISTFILE with the following entries ind_trn/adg0_4/sa1 ind_trn/aem0_3/sa2 • The ROOT directory is /usr/home/mydata/RM • Filename extensions are wav • Wave2feat would be called as • wave2feat [additional args] –c LISTFILE –di /usr/home/mydata/RM –ei wav Wave2feat Arguments: -c –di –ei –do -eo • “-c LISTFILE -di ROOT –ei EXT –do OUTROOT – eo OUTEXT” • OUTROOT is the base directory into which the output cepstrum files must be stored • -OUTEXT is the filename extension that must be employed • E.g. the following commandwave2feat [additional args] –c LISTFILE –di /usr/home/mydata/RM –ei wav –do /usr/home/mydata/cep/RM –eo mfc • Will read /usr/home/mydata/RM/ind_trn/adg0_4/sa1.wav and store the cepstrum computed at • /usr/home/mydata/cep/RM/ind_trn/adg0_4/sa1.mfc Wave2feat Arguments: Explanation • -nist • -raw • -mswav no Defines input format as NIST sphere no Defines input format as raw binary data no Defines input format as Microsoft Wav • These arguments specify the format of the input speech files. • one of them should be set to “yes” – Setting none or more than one to “yes” would not be good Wave2feat Arguments: Explanation • -input_endian little Endianness of input data, big or little, ignored if NIST or MS Wav • -mach_endian little Endianness of machine, big or little • These arguments specify the endianness of the speech file and the native endian of the machine respectively – If the database was recently collected, odds are that both of these are “little” – If the data are in mswav or NIST format, it is not necessary to set these variables anyway Wave2feat Arguments: Explanation • -nchans 1 Number of channels of data (interlaced samples assumed) • -whichchan 1 Channel to process • The audio files are often multi-channel (stereo, 4- or 8-channel recordings). “nchans” informs the machine about the number of channels in the recording • “-whichchan” specifies which of the channels to compute features for Wave2feat Arguments: Explanation • • • • • • • • • • • • • • • • • • -logspec no Write out logspectral files instead of cepstra -feat sphinx SPHINX format - big endian -alpha 0.97 Preemphasis parameter -srate 16000.0 Sampling rate -frate 100 Frame rate -wlen 0.025625 Hamming window length -nfft 512 Size of FFT -nfilt 40 Number of filter banks -lowerf 133.33334 Lower edge of filters -upperf 6855.4976 Upper edge of filters -ncep 13 Number of cep coefficients -doublebw no Use double bandwidth filters (same center freq) -warp_type inverse_linear Warping function type (or shape) -warp_params Parameters defining the warping function -blocksize 200000 Block size, used to limit the number of samples used at a time when reading very large audio files -dither yes Add 1/2-bit noise to avoid zero energy frames -seed -1 Seed for random number generator; if less than zero, pick our own -verbose no Show input filenames Wave2feat Arguments: Explanation • -logspec – • • • • • • SPHINX format - big endian An arcane flag that specifies the format of the output cepstrum (or logspectrum) file. Ignore this flag 0.97 Preemphasis parameter This is the pre-emphasis factor to apply for feature computation. Leave it at its default value -srate – • • • • • • • • • • • sphinx -alpha – Write out logspectral files instead of cepstra Whether to simply write out mel log spectral features instead of cepstra. This basically skips the final DCT step used to compute cepstra -feat – no 16000.0 Sampling rate The sampling rate with at the speech has been sampled -frate 100 Frame rate -wlen 0.025625 Hamming window length -nfft 512 Size of FFT -nfilt 40 Number of filter banks -lowerf 133.33334 Lower edge of filters -upperf 6855.4976 Upper edge of filters -ncep 13 Number of cep coefficients -doublebw no Use double bandwidth filters (same center freq) -warp_type inverse_linear Warping function type (or shape) -warp_params Parameters defining the warping function -blocksize 200000 Block size, used to limit the number of samples used at a time when reading very large audio files -dither yes Add 1/2-bit noise to avoid zero energy frames -seed -1 Seed for random number generator; if less than zero, pick our own -verbose no Show input filenames Required 1. A collection of training recordings • • A set of speech recordings The transcriptions for all recordings 2. A “dictionary” with pronunciation entries for all words in the transcriptions 3. A “filler” dictionary with pronunciations for noise words and silences 4. A “phoneme list” Format of list of training data • A list of training recordings is required in the following format: DIR/FILE DIR/FILE DIR/FILE … • Each entry in this file has the complete relative path to a file (relative to a root) • Each entry includes the name of the file, but not extension • E.g. a file at $ROOT/RM/ind_trn/adg0_4/sa1.wav is listed as “RM/ind_trn/adg0_4/sa1” • Note – the extension “.wav” is not part of the entry Alternate format of list of training data • Sometimes the recordings are long files. In this case we use the following format: DIR/FILE STARTING_FRAME ENDING_FRAME TAG DIR/FILE STARTING_FRAME ENDING_FRAME TAG … • DIR/FILE is the (full) relative path to the cepstrum file • STARTING_FRAME is the frame number at which an utteraence begins • ENDING_FRAME is the frame number at which it ends • If we have computed features at 100 frames / second, and a person speaks from time 10.324 seconds till 12.333, then STARTING_FRAME is 1032 and ENDING_FRAME is 1233 • TAG is a unique tag for each line in the list • JARGON: WE WILL CALL THIS LIST FILE A “CONTROL FILE” The Transcription File • The transcription file has a number of text strings • There are exactly as many text strings as the number of entries in the control file – Regardless of the format of the control file (one file per line or using time stamps) • Each line of text represents what was spoken in the recording represented by the corresponding line in the control file The Transcription File • Each line in the transcription file has the following format • BETTY BOTTER BOUGHT SOME BUTTER (TAG) • The words represent the sequence of words spoken • The “TAG” is either the name of the file – If the control file was in one-file-per-line format • Or the actual TAG used in the corresponding line of the control file The Transcription File • The transcriptions in the transcription file should ideally also identify any extraneous noises and disfluencies – E.g. BUT ++UM++ THE BU- BU- BUTTER WAS ++UH++ ++NOISE_IN_BACKGROUND++ ++SCREAM++ ++SLAPPING_NOISE++ BITTER • The terms in ++*++ are the extraneous sounds • The BU- BU- represent exactly what was spoken – the speaker stuttered The Dictionary File • The Dictionary file contains prounciations of all the words in the transcriptions • BETTY B EH T IY BOTTER B AA T ER BOUGHT B AO T SOME S AH M BUT B AH T BUB AH BUTTER B AH D AXR BUTTER(2) B AH T ER • Each line represents the pronunciation of one of the words • Entries with (?) represent alternate pronunciations of the word – E.g. there are two ways of representing BUTTER • Note: The partial word “BU-” is also listed with its pronunciation The Filler Dictionary • We also need a separate filler dictionary • This lists all the non-speech sound encountered in the transcripts and would look like this • <s> SILENCE </s> SILENCE <sil> SILENCE • • • • ++UM++ +UM+ ++UH++ +UH+ ++NOISE_IN_BACKGROUND++ +NOISE+ ++SCREAM++ +NOISE2+ ++SLAPPING_NOISE++ +NOISE+ The first three entries must always be part of the filler dictionary The remaining terms are for non-speech sounds in the data The terms to the right are “Noise” phonemes Multiple sounds may be mapped onto the same noise phoneme The phoneme list • The phoneme list is a file that contains all the phonemes used in the dictionary and filler dictionary – Including silence and noise phonemes • This can be created by simply running: – Cat $dictionary $fillerdict | awk ‘{for(i=2;i<=NF;i++) print $i}’ | sort –u >! $PHONELIST Training: Prelimaries • In the first pass of training, we add silences at the beginning and end of every utterance in the transcriptions • So we modify entries such as the following BETTY BOTTER BOUGHT SOME BUTTER (buying01) • To <s> BETTY BOTTER BOUGHT SOME BUTTER </s> (buying01) Preliminaries: Setting Parameters • The following are specific to Rita’s setup – The sphinxtrain setup may be slightly different • Modify “variables.def” to point to all the right things: set workdir = BLAH – This is the directory where all the training is being performed set base_dir = BLAH – This is the root directory where the output of the training including log files, buffers, models etc. will be created Preliminaries: Setting Parameters set bindir = BLAH – This is the directory in which all the recognizer executables live set listoffiles = BLAH – This is complete path to the control file with the list of training files Set transcriptfile = BLAH – This is the complete path to the file with transcriptions Preliminaries: Setting Parameters set dictionary = BLAH – This is the complete path to the dictionary file set fillerdict = BLAH – This is complete path to the filler dictionary set phonefile = BLAH – This is the complete path to the file with the list of phonemes set featfiles_dir = BLAH – This is the directory where the feature files reside. This would be the same directory that was given to wave2feat with the –do flag Preliminaries: Setting Parameters set featfile_extension = BLAH – This is the file extension to apply to the files names in the listfile to get the complete name of the feature file. This would be the same argument specified to wave2feat with the –eo flag Preliminaries: Setting Parameters set featfile_extension = BLAH – This is the file extension to apply to the files names in the listfile to get the complete name of the feature file. This would be the same argument specified to wave2feat with the –eo flag set vector_length = BLAH – This specifies the dimensionality of our cepstral vector. Typical value to 13 Preliminaries: Setting HMM Parameters set statesperhmm = BLAH – How many states in the HMM. Typical values or 3 or 5 set skipstate = BLAH – Answer should be “yes” or “no”. This specifies whether the topology of the HMMs should allow transitions from state N to state N+2 directly or not. For 3-state HMMs, this should be “no” and for 5-state HMMs it should be “yes” set gaussiansperstate = BLAH – How many Gaussians are wanted in the Gaussian mixture state output distributions. If we have more training data, we’d want this number to be larger Preliminaries: Setting HMM Parameters set agc = none – “Automatic Gain Control”. Just set to “none” and do not mess with this set cmn = current – How to perform cepstral mean normalization. Set to “current” and do not mess with this set varnorm = BLAH – Do we want to perform variance normalization on our features? Answer “yes” or “no” only. If the answer is “yes” variance normalization must be performed during recognition also Preliminaries: Setting HMM Parameters set feature = 1s_c_d_dd – The type of “extended” feature to train models with. 1s_c_d_dd uses cepstra (c), delta cepstra (d) and double-delta cepstra (dd). Other features are also available in the sphinx. The feature used for decoding must be identical to the one used to train the model. set type = .cont. – We can perform different kinds of parameter sharing. When the type is “.cont”, we let each tied-state be independent. Another variant is .semi. You don’t normally want to use this set n_tied_states = BLAH – How many tied states to learn in total, across all triphones of all phonemes Preliminaries: Setting HMM Parameters set npart = ? – We can parallalize the training into many parts. More is better, but only if we have more processeors. set n_part_untied = ? – There are 3 stages of training – CI, untied and tied. CI and tied are less bandwidth intensive than untied. So the untied portion is better run in fewer parts than the other two stages of training. This variable specifies how many parts to run untied training in set convergence_ratio = BLAH – The training continues at each stage until the likelihood of the training data converges. This value (e.g. 0.00001) specifies the minimum relative increase at likelihood when we stop iterating set maxiter = BLAH – We also specify a maximum number of iterations to perform, regardless of likelihood convergence Training: Stage 1 – context independent training • Train context independent models. – On Rita’s Setup • >cd 01.ci_chmm • >./slave_convg.csh • • Slave_convg will train context-independent models Process: – Slave_convg launches a number of “baum_welch” jobs and a norm_and_launch_bw job – The norm_and_launch_bw only runs when all baum_welch jobs have run to completion • • • Baum_welch computes a set of “buffers” to compute the next iteration of models from Norm_and_launch_bw gathers these buffers, computes a new set of models, and calls slave_convg to launch jobs for the next iteration If the training has converged, norm_and_launch_bw simply invokes stage 2 of training, instead of launching further interations Training: Stage 2 – “untied” context dependent training • >cd 02.cd_chmm • >./slave_convg.csh • Slave_convg will train context-dependent models without parameter sharing of any kind • Process: as before slave_convg launches baum_welch jobs and a norm_and_launch_bw job. The latter aggregates baum_welch output and calls slave_convg for the next iteration • The output of this stage of training is a set of models for contextdependent phonemes (triphones) • These models do not share state output distributions – Every triphone has its own unique set of state output distributions – These distributions are Gaussians • Upon convergence norm_and_launch_bw calls the next stage Training: Stage 3 – Decision Test • > cd 03.buildtrees • > ./slave_treebuilder.csh • The “untied” context-dependent models are used to train decision trees for state tying • The “slave” script launches one job for each state of each phoneme – A separate decision tree is trained for each state of each phoneme • > ./prune_tree_and_tie_state.csh – This prunes all the trees and builds a lookup table (mdef) that stores the index of the tied state for each state of each triphone – It then launches stage 4 of training Training: Stage 4 – Tied-State Models • > cd 04.cd_chmm • > ./slave_convg.csh • The slave launches baum_welch jobs and a norm_and_launch_bw job – The norm_and_launch_bw aggregates baum-welch outputs, reestimates models, and if convergence has not happened, calls slave_convg – Upon convergence, if the desired no. of Gaussians has not been obtained, norm_and_launch_bw first calls a job that splits Gaussians before relaunching the next set of baum-welch jobs – If convergence has happened and the desired number of Gaussians have been obtained, norm_and_launch_bw exits