ACOUSTIC –PHONETIC SPEECH PARAMETERS FOR SPEAKER-INDEPENDENT SPEECH RECOGNITION

advertisement
ACOUSTIC –PHONETIC SPEECH PARAMETERS FOR SPEAKER-INDEPENDENT
SPEECH RECOGNITION
Om Deshmukh, Carol Y. Espy-Wilson and Amit Juneja
University of Maryland, Department of Electrical & Computer Engineering, A. V. Williams Bldg.,
College Park, MD 20752
ABSTRACT
Coping with inter-speaker variability (i.e., differences in
the vocal tract characteristics of speakers) is still a
major challenge for Automatic Speech Recognizers. In
this paper, we discuss a method that compensates for
differences in speaker characteristics. In particular, we
demonstrate that when continuous density hidden
Markov model based system is used as the back-end , a
Knowledge-Based Front End (KBFE) can outperform
the traditional Mel-Frequency Cepstral Coefficients
(MFCCs), particularly when there is a mismatch in the
gender and ages of the subjects used to train and test the
recognizer.
This work was supported by NSF grant # SBR9729688 and NIH grant # 1K02DC00149.
1. INTRODUCTION
A major challenge for present speech recognition
systems is tolerance to speaker differences and
variations in context and environment. Differences in
speaker characteristics are a major source of interspeaker variation. A considerable amount of research
has attempted to deal with one aspect of this variability,
the differences in the vocal tract length across speakers,
by developing Vocal Tract Length Normalization
(VTLN) [1,2] techniques.
The most popular VTLN technique performs
speaker-specific piecewise linear frequency scaling of
the Mel-Frequency Cepstral Coefficients (MFCCs) [3].
The overall improvement in the word error rate (WER)
obtained with this technique is usually on the order of
0.6% as compared to results obtained without VTLN.
Other methods use an exponential warping scale or a
one-bark-shift in the filter bank analysis of male speech
relative to female speech [4].
In this paper, we extend previous work reported in
[5] where we developed a front end that consisted of
acoustic parameters (APs) related to the manner
phonetic features: sonorant, syllabic, continuant, voiced
and strident, in addition to silence. Recognition results
showed that the relative measures used for the APs
were considerably more speaker independent than the
MFCCs. Presently, we have extended the set of APs to
include relative measures that extract information about
some of the place phonetic features. Specifically, we
focused on the place features alveolar and palatal for
strident fricatives and alveolar and labial for stop
consonants. Additionally, APs were developed for the
place features high, low, front and back for vowels, and
rhotic for /r/.
Section 2 discusses the design procedure and the
APs used. Section three talks about the details of the
HMM models and other system specifications. Results
from the two recognition systems - AP front end and
MFCC front end- are compared in section 4. Important
conclusion and future work are outlined in section 5.
2. KBFE
Phonetic components in a speech signal carry the
relevant information needed to interpret the linguistic
message. To extract this information, we use a signal
representation that consists of continuous density
parameters that automatically extract the acoustic
correlates of the phonetic features. In previous work
[6], we showed the importance of using relative
measures to reduce the speaker dependency of the APs.
In particular, the frequency bands used to characterize
the spectral shape of strident fricatives was based on the
third formant (F3) estimated from each utterance.
Similarly, in the present system, we have incorporated
the spectral shape parameters Ahi-A23 and Av-Ahi
proposed in [7] for characterizing alveolar and labial
stop consonants. We have modified these parameters to
make them more speaker-independent through F3
normalization. The APs for the place features
high,low,front and back in vowels and rhotic in /r/ were
normalized by computing differences between formants
in Hertz and Bark.
Note that at present, KBFE consists of only 28 APs
since we have not yet designed automatic and robust
algorithms to extract the acoustic correlates of all of the
20-odd phonetic features. The 28 APs were designed to
capture the acoustic correlates of only 13 of the
phonetic features (there is some redundancy in the
parameters that needs to be removed). Notably, we do
not have algorithms yet for the manner features nasal,
and lateral, the place features alveolar, velar and labial
for nasal consonants, the place feature labial and dental
for the nonstrident fricatives, and the place feature velar
for stop consonants.
3. SYSTEM DISCRIPTION
HTK [8], an HMM-based sub-word recognition system,
was used for the back end processing. Thirteen MelCepstral Coefficients along with their first and second
derivatives were computed every 5ms. 30 channels
were used for the Mel coefficients calculations and a
window size of 25ms was used. Zero mean
normalization was also incorporated. The APs were
also computed every 5ms over a window of 25ms.
3.1. Model Topography
10 different HMM models, one for each digit were
developed. Different topographies with various tying
schemes were tried out and the one that gave optimal
results with Cepstral parameters was selected
The final topography was as follows:
Each of the digits was modeled as a three state
(plus 2 terminal states) HMM with eight Gaussian
mixtures in each state. Each mixture was initialized as
zero mean and unit diagonal covariance. Each mixture
had diagonal variance and all the mixtures weights in
all the states were initialized at the same value.
Left-to-right state transition with no skips was
incorporated with the additional constraint that each
model has to start with the first state. All of the
allowable transitions were initialized as equi-propable.
The same topography was maintained for both the
MFCCs and the KBFE.
3.2. Databases
Recognition experiments were run on isolated digits
from the TIDIGITS [9] and TI46 [10] databases. The
adult speech was taken from TI46 whereas the children
speech was taken from TIDIGITS. The children speech,
with the original sampling rate of 20KHz was
appropriately down sampled to the sampling rate of
TI46 data, 12.5KHz.
The suggested training set of TI46 has 160
utterances of each digit: 10 repetitions by each of the 8
male and 8 female subjects. The testing set has 256
utterances of each digit: 16 repetitions by each of the 8
male and 8 female subjects.
The TIDIGTS’ training set has 102 utterances of
each digit: two repetitions by each of the 25 boys and
26 girls. The testing set has 100 utterances of each digit:
two repetitions by each of the 25 boys and 25 girls.
Boys were between the ages of 6-14 and girls were
between the ages of 7-15.
4. RESULTS
4.1. Experiments
A series of experiments were carried out to test the
robustness of the parameters towards gender and age
variability. The performance benchmark was set by
training and testing on the subjects of same group.
(Adult training with adult testing and children training
with children testing)
To test the effect of age variability the system was
trained on adult data and tested on children data. In a
different experiment, train and test data were swapped.
To test the effect of gender variability the system was
trained on males, tested on females & trained on boys,
tested on girls and then with the training and testing
subjects swapped.
4.2. Performance
Tables 1-3 show the overall word accuracy results for
the two systems described above.
Table: 1 Results for gender variability in adults (in %
accuracy). ‘Tr’ specifies the training data and ‘Te’
specifies the test data.
MFCCs
(39 pars)
KBFE
(28 pars)
Tr: Females
Tr: Adult Tr: Males
Te: Adult Te: Females Te: Males
99.88
70.27
68.29
97.26
91.67
78.33
Table: 2 Results for gender variability in children (in %
accuracy). ‘Tr’ specifies the training data and ‘Te’
specifies the test data.
MFCCs
(39 pars)
KBFE
(28 pars)
Tr: Children Tr: boys
Te: Children Te: girls
98.30
95.60
Tr: girls
Te: boys
95.00
94.27
96.56
92.33
Table: 3 Results for age variability (in % accuracy)
‘Tr’ specifies the training data and ‘Te’ specifies
the test data.
Tr:Adult Tr:Child Tr:Adult Tr:Child
Te:Adult Te:Child Te:Child Te:Adult
MFCCs 99.88
98.30
60.20
62.37
(39 pars)
KBFE
97.26
94.71
85.15
85.88
(28 pars)
4.3. Discussion
The first column of Tables 1-3 show that at present,
KBFE does not perform quite as well as the MFCCs.
Presumably, this slightly lower performance is due to
the fact that KBFE does not consist of a full speech
signal representation. As stated above, KBFE consists
of only 28 APs. APs for some phonetic features needed
for digit recognition, e.g. the feature nasal, and the place
features for labial and dental weak fricatives, are
missing. Analyses of the recognition results show that
the major sources of error are the confusions between
“one” and “nine” (8.3% substitutions), and between
“five” and “nine”(2.4% substitutions). Thus, we feel
strongly that the performance of KBFE with HTK will
improve greatly once APs have been added for the
missing phonetic features.
The other columns in Tables 1-3 (with the
exception of Table 2) show that KBFE outperforms the
MFCCs when the test data is very different from the
training data. Relative to the results obtained with the
MFCCs, the results with KBFE show an error reduction
of 39% when there is a gender difference between the
subjects used to train and test the system (Table 1).
Similarly, KBFE shows an error reduction of about
63% when there is a considerable age difference
between the subjects used to train and test the
recognizer (Table 3).
Note that there is very little change in the results of
Table 2. Such consistent results suggest that boys and
girls between 7 and 15 years of age have very similar
vocal tract lengths.
5. CONCLUSION & FUTURE WORK
In this paper, we have shown that the APs, which are
based on phonetic features, target the linguistic
information in the speech signal. Further, we have
shown that the use of relative measures can greatly
reduce differences in speaker characteristics. Work is in
progress to finish the development of KBFE. Once
completed we expect considerable improvement in
accuracy and even more consistent results across
different speaker groups.
REFERENCE
[1] Zhan Puming, Waibel Alex “Vocal tract length
normalization for large vocabulary continuous speech
recognition”, CMU-CS-97-148, May 1997.
[2] Zhan Puming and Westphal Martin “Speaker
normalization based on frequency warping”, 1997
ICASSP 1997, Munich.
[3] Hain T., et. al, “ The 1998 HTK system for
transcription of conversational telephone speech”, Proc.
ICASSP’99, Phoenix, pp 57-60.
[4] Ono Yoshino, et. al, “ Speaker normalization using
constrained spectra shifts in auditory filter domain”,
Eurospeech-93, 355-358,1993.
[5] Bitar, N. and Espy-Wilson, C., “Knowledge based p
parameters for HMM speech recognition”, 1996
ICASSP, Vol 1., 29-32, May 1996
[6] Bitar, N. and Espy-Wilson, C., “The design of
acoustic parameters for speaker-independent speech
recognition”, Eurospeech 97, Vol 3., 1239-1242
[7] Stevens, K.N., and S.Y. Manuel, “Revisiting
Place of
Articulation Measures for Stop
Consonants: Implications for Models of
Consonant Production”, In the Proceedings of XIV
International congress of phonetic sciences Vol.2
pp 1117-1120.
[8] HMM Tool Kit (HTK), 3.0,
http://htk.eng.cam.ac.uk/
[9] R. Gary Leonard and George Doddington,
TIDIGITS, http://www.ldc.upenn.edu/
[10] Mark Liberman, TI46-Word,
http://www.ldc.upenn.edu/
Download