CommunicatorPackFAQ

advertisement
These are the things I thought of, which might make the dataset more
useful for non-speech people:
-------------------------- MFCC features:
1. A short, general description of the data, including: the number of
speakers (approximately), rough gender distribution, what's the
communicator task, what kind of text was read (in fact if you have the
reference text that would be nice to include too),
Answer: Quote the web page of Communicator, "The CMU Communicator project
explores advanced dialog management architectures for complex problem solving tasks.
It is a project under the Carnegie Mellon Sphinx Group and is funded by the DARPA
Communicator program.".
Testers of the system were allowed to ask the system for flight information using
unconstrained language. The package distributed only the testing sets.
[Arthur: I currently don't have information about its speaker distribution and the gender
distribution. I will try to look it up from the literature.]
2. How long a time window is each 39-dim vector? Do windows overlap?
Answer:
Some basic information about the front-end we used.
Sampling Rate: 8000
Cepstral Mean Normalization: Yes
Frame Rate: 100 frames/second
Window Length: 0.0256s
Lower Filter: 200Hz
Upper Filter: 3500Hz
Number of filters: 31
Number of MFCCs: 13
So, reply this question, the windows size is 0.0256s , the windows overlapped and
moving at 100 frames/second. That means the frame shift is 0.0125s.
[The number of frame shift need to be confirmed. Though I don’t see it is crucial]
3. What are the 39 features? I find the first vector to be
13 something, 26 zeros
the second vector is
13 something, 13 zeros, 13 something
which makes me wonder whether the order is MFCC, deltadelta, delta?
Answer: This is an artifact that caused by dynamic feature computation. I will explain it
in Question 5.
4. Within the 13dim MFCC, is the order from low frequency (1st feature)
to high frequency (the 13th feature) or the other way around? Is there
an energy dimension?
Answer: I think it would be nice to brief of what’s going on with the front end at this
point.
For each frame, the following processing was done in the following way.
1, a pre-emphasis filter will be applied to the sample points.
2, hamming window is applied to the frame.
3, Compute the FFT of a frame and obtain a vector of spectrum.
4, Pass the vector of spectrum through a sets of filter banks. In our case, the lowest filter
is in 200 Hz and the highest one is 3500 Hz. Number of filters is 31. The shape of each
filter is triangular. The magnitude will be obtained for each vector. At this point we will
have a vector with length 31.
5, This vector will be transformed by discrete cosine transform again.
Therefore, it is pretty hard to view the feature based on the frequency.
[Arthur: I found it quite difficult to intuitively understand DCTed vectors myself. I could
only say this is a smooth version of the spectral coefficients.
This part may still need to be expanded in future. I may want to include formulas at
some point.
]
5. How do you compute delta and deltadelta exactly (please give formulae
like x[t+1]-x[t-1])?
---------------------------- Gaussians:
-------- the mean
Answer: We are using a dynamic feature type called s3_1x39. The exact computation is
like this.
The cepstral vector has 13 dimension. The first 12 is the actual cepstral vector (x[0][011]). the last one could be C0 or energy (x[0][12]). The index 0 represents the current
frame. We use this notation because the computation will require previous and future
frames.
String like s3_1x39 actually represents one way to organize the feature vector to its
dynamic coefficient. (There are actually many ways, another way is 1s_c_d_dd.
S3_1x39 actually organize the 39 dimensional vector in this way, it could be segmented
into 4 parts: 12 dimensions of cepstral vector ( c ), 12 dimensions of delta coefficients
(d), 3 dimension of energy, delta energy and delta delta energy (e, de, dde), 12 dimension
of delta delta coefficients (dd)
For the first 12 dimension, the 12 cepstral vector for this frame is used.
Or just c[i]=x[0][i] for i =0 to 11
For the next 12 dimension, they are delta coefficient and could be computed as
d[i]=x[2][i] – x[-2][i] for i = 0 to 11
The next dimension is energy, delta energy and delta delta energy
e=x[12]
de=x[2][12] – x[-2][12]
dde=(x[3][12] – x[-1][12]) – (x[1][12] – x[-3][12])
For the next 12 dimension, they are the delta delta coefficient.
dd[i]=(x[3][i] –x[-1][i] ) – (x[1][i] – x[-3][i]) for i=0 to 11
The above formulae will require that every frame, the previous 3 frames and the future 3
frames exists. So, some special treatment are required to pad 3 frames at the beginning
of an utterance and 3 frames at the end of an utterance. Sphinx 3.x takes care of it by
replicating the first three frames of an utterance once. So the sequence of frames will be
converted from
x[0] x[1] x[2] x[3] x[4] . ……
to
x[1] x[2] x[3] x[0] x[1] x[2] x[3] x[4] . ……
If we do in this way, then for the first frame. the delta and delta delta will actually
become zero. (X[2] – X[2]= 0 , x[3]-x[3] – (x[1] –x[1])=0)
For the second frame, the delta coefficient will then become 0 because x[3] – x[3] is
again zero.
The reason we have this interesting treatment is partially caused by legacy, partially
caused by performance concern. This treatment actually did better than padding 0 at the
beginning of the frames.
[Note, for our experiment, we could just ignore them if we don’t like it]
6. I sorted of guessed the meaning of
param 2165 1 64
mgau 0
feat 0
could you please explain them explicitly?
Answer:
2165 is the number of senones, 1 is the number of stream.(To differentiate with the
Sphinx II feature which has four streams) 64 is the number of Gaussian component in a
GMM. You could safely ignore the number of stream, I will describe it as legacy
information.
The number of senones actually contain both CI senone (165) and CD senones (2000)
If you look at the file labeled as means
mgau 0 , feat 0 actually represents the mean for the first Gaussian.
If you look at the file labeled as variance
mgau 0 , feat 0 actually represents the variance for the first Gaussian.
7. Just to be sure: the order of the featers for 'density' is the same
as in the MFCC vectors, am I right?
Answer:
[I am confused by this question, let us go through another iteration]
8. Is each 'mgau' for a senone? Is it possible to give a file mapping
triphones to senones? If the mapping is done by a tree, would it be
possible to printout the tree in a human readable form? (If this takes
too much time, then forget it)
--------- the weights
Answer:
Yes, each mgau is for one senone. The file mapping from triphone could be found in
the model definition file or mdef. You could find it in the corresponding
model_architecture directory.
9. What's the big number in "mixw [0 0] 1.891396e+05"?
10. mixw is a 8x8 matrix. Which is the correct order: 1st row, 2nd
row..., or 1st column, 2nd column?
Answer:
1.891396e+05 is actually the number of occurrence of senone 0 in training (the next 0 is
the stream index, again no need to worry about it). It is a decimal number because
partial count is used in a Baum-Welch algorithm of SphinxTrain. The actually mixture
weight is actually described by the 8x8 matrix, their order is this
01234567
8 9 10 ……….
……………….
--------- the variances
11. The first line looks like
density 0 1.146e-01 6.529e-02 4.270e-02 2.791e-02 2.677e-02 1.850e-02
1.773e-02 1.819e-02 1.450e-02 1.538e-02 1.221e-02 1.118e-02 4.417e-02
3.100e-02 2.840e-02 2.683e-02 2.621e-02 2.568e-02 2.562e-02 2.509e-02
2.222e-02 1.967e-02 2.039e-02 1.929e-02 2.556e+00 2.540e-01 4.002e-01
9.349e-02 6.639e-02 5.584e-02 5.552e-02 5.215e-02 4.832e-02 4.748e-02
5.049e-02 4.336e-02 3.828e-02 3.968e-02 4.245e-02
Is the number '1.146e-01' the variance or standard error (sqrt of
variance) of feature dimension 1?
Answer:
It is variance.
-------- some general questions
12. on what data was the HMM trained on? How much data and how many
speakers?
Answer,
The model were trained using 80 hours of training data.
[Under construction, I need to look it up. ]
13. Do you think we can fully recover all the HMMs from the data you
provided? It seems we only need the triphone->senone mapping, and a
transition matrix -- which I assume is 3-state linear chain with self
transitions. Are the transition probabilities fixed at 0.5, or are them
also trained?
Answer:
The package I provided right now is just the testing set. The actual training data needs 10
CDs to record so I just skip it in our initial task
Jerry is right in the sense that we only need the triphone to senone mapping (This can be
found in the mdef file).
The 3 state HMM is trained by Baum-Welch algorithm. The current version of the
decode we are using (Sphinx 3.5) has a limitation that only accept 3 to 5 state HMMs.
Though I don’t see it is a big problem for us.
Download