These are the things I thought of, which might make the dataset more useful for non-speech people: -------------------------- MFCC features: 1. A short, general description of the data, including: the number of speakers (approximately), rough gender distribution, what's the communicator task, what kind of text was read (in fact if you have the reference text that would be nice to include too), Answer: Quote the web page of Communicator, "The CMU Communicator project explores advanced dialog management architectures for complex problem solving tasks. It is a project under the Carnegie Mellon Sphinx Group and is funded by the DARPA Communicator program.". Testers of the system were allowed to ask the system for flight information using unconstrained language. The package distributed only the testing sets. [Arthur: I currently don't have information about its speaker distribution and the gender distribution. I will try to look it up from the literature.] 2. How long a time window is each 39-dim vector? Do windows overlap? Answer: Some basic information about the front-end we used. Sampling Rate: 8000 Cepstral Mean Normalization: Yes Frame Rate: 100 frames/second Window Length: 0.0256s Lower Filter: 200Hz Upper Filter: 3500Hz Number of filters: 31 Number of MFCCs: 13 So, reply this question, the windows size is 0.0256s , the windows overlapped and moving at 100 frames/second. That means the frame shift is 0.0125s. [The number of frame shift need to be confirmed. Though I don’t see it is crucial] 3. What are the 39 features? I find the first vector to be 13 something, 26 zeros the second vector is 13 something, 13 zeros, 13 something which makes me wonder whether the order is MFCC, deltadelta, delta? Answer: This is an artifact that caused by dynamic feature computation. I will explain it in Question 5. 4. Within the 13dim MFCC, is the order from low frequency (1st feature) to high frequency (the 13th feature) or the other way around? Is there an energy dimension? Answer: I think it would be nice to brief of what’s going on with the front end at this point. For each frame, the following processing was done in the following way. 1, a pre-emphasis filter will be applied to the sample points. 2, hamming window is applied to the frame. 3, Compute the FFT of a frame and obtain a vector of spectrum. 4, Pass the vector of spectrum through a sets of filter banks. In our case, the lowest filter is in 200 Hz and the highest one is 3500 Hz. Number of filters is 31. The shape of each filter is triangular. The magnitude will be obtained for each vector. At this point we will have a vector with length 31. 5, This vector will be transformed by discrete cosine transform again. Therefore, it is pretty hard to view the feature based on the frequency. [Arthur: I found it quite difficult to intuitively understand DCTed vectors myself. I could only say this is a smooth version of the spectral coefficients. This part may still need to be expanded in future. I may want to include formulas at some point. ] 5. How do you compute delta and deltadelta exactly (please give formulae like x[t+1]-x[t-1])? ---------------------------- Gaussians: -------- the mean Answer: We are using a dynamic feature type called s3_1x39. The exact computation is like this. The cepstral vector has 13 dimension. The first 12 is the actual cepstral vector (x[0][011]). the last one could be C0 or energy (x[0][12]). The index 0 represents the current frame. We use this notation because the computation will require previous and future frames. String like s3_1x39 actually represents one way to organize the feature vector to its dynamic coefficient. (There are actually many ways, another way is 1s_c_d_dd. S3_1x39 actually organize the 39 dimensional vector in this way, it could be segmented into 4 parts: 12 dimensions of cepstral vector ( c ), 12 dimensions of delta coefficients (d), 3 dimension of energy, delta energy and delta delta energy (e, de, dde), 12 dimension of delta delta coefficients (dd) For the first 12 dimension, the 12 cepstral vector for this frame is used. Or just c[i]=x[0][i] for i =0 to 11 For the next 12 dimension, they are delta coefficient and could be computed as d[i]=x[2][i] – x[-2][i] for i = 0 to 11 The next dimension is energy, delta energy and delta delta energy e=x[12] de=x[2][12] – x[-2][12] dde=(x[3][12] – x[-1][12]) – (x[1][12] – x[-3][12]) For the next 12 dimension, they are the delta delta coefficient. dd[i]=(x[3][i] –x[-1][i] ) – (x[1][i] – x[-3][i]) for i=0 to 11 The above formulae will require that every frame, the previous 3 frames and the future 3 frames exists. So, some special treatment are required to pad 3 frames at the beginning of an utterance and 3 frames at the end of an utterance. Sphinx 3.x takes care of it by replicating the first three frames of an utterance once. So the sequence of frames will be converted from x[0] x[1] x[2] x[3] x[4] . …… to x[1] x[2] x[3] x[0] x[1] x[2] x[3] x[4] . …… If we do in this way, then for the first frame. the delta and delta delta will actually become zero. (X[2] – X[2]= 0 , x[3]-x[3] – (x[1] –x[1])=0) For the second frame, the delta coefficient will then become 0 because x[3] – x[3] is again zero. The reason we have this interesting treatment is partially caused by legacy, partially caused by performance concern. This treatment actually did better than padding 0 at the beginning of the frames. [Note, for our experiment, we could just ignore them if we don’t like it] 6. I sorted of guessed the meaning of param 2165 1 64 mgau 0 feat 0 could you please explain them explicitly? Answer: 2165 is the number of senones, 1 is the number of stream.(To differentiate with the Sphinx II feature which has four streams) 64 is the number of Gaussian component in a GMM. You could safely ignore the number of stream, I will describe it as legacy information. The number of senones actually contain both CI senone (165) and CD senones (2000) If you look at the file labeled as means mgau 0 , feat 0 actually represents the mean for the first Gaussian. If you look at the file labeled as variance mgau 0 , feat 0 actually represents the variance for the first Gaussian. 7. Just to be sure: the order of the featers for 'density' is the same as in the MFCC vectors, am I right? Answer: [I am confused by this question, let us go through another iteration] 8. Is each 'mgau' for a senone? Is it possible to give a file mapping triphones to senones? If the mapping is done by a tree, would it be possible to printout the tree in a human readable form? (If this takes too much time, then forget it) --------- the weights Answer: Yes, each mgau is for one senone. The file mapping from triphone could be found in the model definition file or mdef. You could find it in the corresponding model_architecture directory. 9. What's the big number in "mixw [0 0] 1.891396e+05"? 10. mixw is a 8x8 matrix. Which is the correct order: 1st row, 2nd row..., or 1st column, 2nd column? Answer: 1.891396e+05 is actually the number of occurrence of senone 0 in training (the next 0 is the stream index, again no need to worry about it). It is a decimal number because partial count is used in a Baum-Welch algorithm of SphinxTrain. The actually mixture weight is actually described by the 8x8 matrix, their order is this 01234567 8 9 10 ………. ………………. --------- the variances 11. The first line looks like density 0 1.146e-01 6.529e-02 4.270e-02 2.791e-02 2.677e-02 1.850e-02 1.773e-02 1.819e-02 1.450e-02 1.538e-02 1.221e-02 1.118e-02 4.417e-02 3.100e-02 2.840e-02 2.683e-02 2.621e-02 2.568e-02 2.562e-02 2.509e-02 2.222e-02 1.967e-02 2.039e-02 1.929e-02 2.556e+00 2.540e-01 4.002e-01 9.349e-02 6.639e-02 5.584e-02 5.552e-02 5.215e-02 4.832e-02 4.748e-02 5.049e-02 4.336e-02 3.828e-02 3.968e-02 4.245e-02 Is the number '1.146e-01' the variance or standard error (sqrt of variance) of feature dimension 1? Answer: It is variance. -------- some general questions 12. on what data was the HMM trained on? How much data and how many speakers? Answer, The model were trained using 80 hours of training data. [Under construction, I need to look it up. ] 13. Do you think we can fully recover all the HMMs from the data you provided? It seems we only need the triphone->senone mapping, and a transition matrix -- which I assume is 3-state linear chain with self transitions. Are the transition probabilities fixed at 0.5, or are them also trained? Answer: The package I provided right now is just the testing set. The actual training data needs 10 CDs to record so I just skip it in our initial task Jerry is right in the sense that we only need the triphone to senone mapping (This can be found in the mdef file). The 3 state HMM is trained by Baum-Welch algorithm. The current version of the decode we are using (Sphinx 3.5) has a limitation that only accept 3 to 5 state HMMs. Though I don’t see it is a big problem for us.