JP-53-IEEE-ASLP-Huan.. - The University of Texas at Dallas

advertisement
2444
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007
Unsupervised Discriminative Training With
Application to Dialect Classification
Rongqing Huang, Member, IEEE, and John H. L. Hansen, Fellow, IEEE
Abstract—Automatic dialect classification has gained interest
in the field of speech research because of its importance in characterizing speaker traits and knowledge estimation which could
improve integrated speech technology (e.g., speech recognition,
speaker recognition). This study addresses novel advances in unsupervised spontaneous dialect classification in English and Spanish.
The problem considers the case where no transcripts are available
for training and test data, and speakers are talking spontaneously.
The Gaussian mixture model (GMM) is used for unsupervised
dialect classification in our study. Techniques which aim to deal
with confused acoustic regions in the GMMs are proposed, where
confused regions in the GMMs are identified through data driven
methods. The first technique excludes confused regions by finding
dialect dependence in the untranscribed audio by selecting the
most discriminative Gaussian mixtures [mixture selection (MS)].
The second technique includes the confused regions in the model,
but the confused regions are balanced over all classes. This technique is implemented by identifying discriminative frames and
confused frames in the audio data [frame selection (FS)]. The
new confused regions contribute to model representation but does
not impact classification performance. The third technique is to
reduce the confused regions in the original model. Minimum classification error (MCE) is applied to achieve this objective. All three
techniques implement discriminative training for GMM-based
classification. Both the first technique (MS-GMM, GMM trained
with mixture selection) and the second technique (FS-GMM,
GMM trained with frame selection) improve dialect classification
performance. Further improvement is achieved after applying
the third technique (MCE training) before the first or second
techniques. The system is evaluated using British English dialects
and Latin American Spanish dialects. Measurable improvement
is achieved in both corpora. Finally, the system is compared with
human listener performance, and shown to outperform human
listeners in terms of classification accuracy.
Index Terms—Accent and dialect, accent classification, automatic dialect classification, discriminative training, English
dialects, frame selection Gaussian mixture model (FS-GMM),
Gaussian mixture selection, minimum classification error (MCE),
robust speech recognition, Spanish dialects.
Manuscript received November 15, 2006; revised June 6, 2007. This work
was supported by the U.S. Air Force Research Laboratory, Rome, NY, under
Contract FA8750-04-1-0058 and RADC under Contract A40104. Any opinions,
findings and conclusions expressed in this material are those of the authors and
do not necessarily reflect the views of the U.S. Air Force. The associate editor
coordinating the review of this manuscript and approving it for publication was
Dr. Timothy J. Hazen.
R. Huang was with the Center for Robust Speech Systems (CRSS), Department of Electrical Engineering, University of Texas at Dallas, Richardson, TX
75083-0688 USA. He is now with Nuance Communications, Burlington, MA
01803 USA.
J. H. L. Hansen is with the Center for Robust Speech Systems (CRSS), Department of Electrical Engineering, University of Texas at Dallas, Richardson,
TX 75083-0688 USA (e-mail: john.hansen@utdallas.edu).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASL.2007.903302
I. INTRODUCTION
HIS study employs the following definition for the term
accent: “The cumulative auditory effect of those features
of pronunciation which identify where a person is from regionally and socially. The linguistic literature emphasizes that the
term refers to pronunciation only, and is thus distinct from dialect, which refers to grammar and vocabulary as well” [5].
In our study, we feel dialect/accent is a pattern of pronunciation and/or vocabulary of a language used by the community of
native/non-native speakers belonging to some geographical region. For example, American English and British English are
two dialects of English; English spoken by native Chinese or
Germans are two accents of English. Some researchers have a
slightly different definition of dialect and accent, depending on
whether they approach the problem from a linguistics or speech
science/engineering perspective. In our study, we will use “dialect” and “accent” interchangeably, since the formulated algorithms in this study can be applied to both dialect and accent detection. Automatic dialect classification is important for
characterizing speaker traits, as well as estimating knowledge
that could be used to improve speech system performance (e.g.,
speech recognition, speech coding, speaker recognition). Dialect/accent is one of the most important factors next to gender
that influence automatic speech recognition (ASR) performance
[10], [11]. Dialect knowledge could be used in various components of an ASR system such as pronunciation modeling [20],
lexicon adaptation [27], and acoustic model training [14] and
adaptation [7]. Dialect knowledge could be directly applied in
automatic call center and directory lookup service [31]. Effective methods for accent modeling and detection have also been
developed, which can also contribute to improved speech systems [32].
Our efforts for dialect identification focus on classifying
unconstrained audio, which represents unknown gender, unknown (new) speaker, and unknown text. If transcripts exist
for the associated training audio, we have previously proposed
a word-based dialect classification (WDC) algorithm which
turns the text-independent dialect classification problem into a
text-dependent dialect classification problem [13] and achieves
very high classification accuracy. If the training data size is
too small to train the word specific models, a context-adaptive
training (CAT) algorithm was proposed to address this problem
and also achieves high classification accuracy [12]. If there
are no available transcripts for the training data, the above
algorithms cannot be applied, and therefore an unsupervised
T
1558-7916/$25.00 © 2007 IEEE
HUANG AND HANSEN: UNSUPERVISED DISCRIMINATIVE TRAINING WITH APPLICATION TO DIALECT CLASSIFICATION
algorithm must be formulated.1 The Gaussian Mixture Model
(GMM)-based classifier has been applied to unsupervised
dialect classification [26] and text-independent speaker recognition [25] successfully. In our study, the GMM-based classifier
is used for unsupervised dialect classification, with the focus
on how to formulate effective discriminative GMM training to
achieve improved performance.
In our study, three dialects from two languages are considered, which include British English dialects from Cambridge,
Belfast, and Cardiff; and Latin American Spanish dialects
from Cuba, Peru, and Puerto Rico. The differences in patterns
among dialects may differ from one language to another. In
many languages, however, dialect differences occur for some
phonemes being in certain positions in words or phrases. In
English dialects, a rhotic after the vowel (e.g., farm: /F AA
R M/ vs. /F AA M/), and syllable boundary (e.g., self#ish vs.
sel#fish) are dialect differences visible at the word level [29].
Spanish dialects also concentrate on particular phonemes being
at certain positions [2], [19]. For example, /s/ in the syllable
final is dropped by Cuban Spanish and is reinforced by Peruvian Spanish. The GMM classifier trained and tested on human
labels which only includes this information can achieve 98%
accuracy on 20-s audio files [30]. While dialect dependence
information is represented by certain acoustic events, it is
straightforward that some acoustic events show no difference
among dialects, which might distract or even confuse dialect
classification. Since there are no transcripts available for the
audio training data, an automatic way to identify the dialect-dependent and dialect-confusing acoustic events is valuable. If
the acoustic events are represented by Gaussian mixtures in the
model space, the identification task can be implemented in the
model space as well. In this study, we propose a data-driven
method to identify the confused regions in the model space.
Next, three distinct techniques are proposed to deal with the
confused regions in the model space. The first technique is
to exclude the confused regions in the model, and the second
technique is to include the confused region which is normalized
so that it could map the acoustic events but not impact the
classification decision. The third technique is based on the
well-known minimum classification error (MCE) criterion,
which essentially reduces the confused regions in the model.
A very interesting observation is that the first or the second
technique can be applied after the third technique.
The remainder of this paper is organized as follows. Section II
introduces the GMM-based classification system. Section III
is dedicated to the three discriminative techniques for GMM
training. The first technique—mixture selection-based GMM
training (MS-GMM) is proposed in Section III-A; the second
technique—frame selection-based GMM training (FS-GMM)
is proposed in Section III-B; the third technique— MCE-based
GMM training is reviewed in Section III-C. The idea of combining the first or second technique with the third technique is
1The dialect label for the training data is always known. Please note the term
“unsupervised training” in this study means transcription-free training. This is
different with the traditional definition of “unsupervised training” in the pattern
recognition community.
2445
Fig. 1. Naive Bayes model. c is the class label, x is the ith observation vector.
Fig. 2. Baseline GMM-based unsupervised training system.
presented in Section III-D. Experimental evaluation of the two
corpora are presented in Section IV, which includes a comparison between human and machine dialect classification in
Section IV-D. Finally, a summary and conclusions are presented
in Section V.
II. GMM-BASED CLASSIFICATION ALGORITHM
Since no transcripts are available for training and test data,
it is difficult to build a supervised generative model such
as an HMM. The simplest method for unsupervised speech
related classification is to apply a naive Bayes classifier [8]
(see Fig. 1). In naive Bayes, each observation vector is generated independently. The sequence information of observation
vectors is omitted. The typical naive Bayes model in speech
processing is vector quantization (VQ) [9] and the Gaussian
mixture model (GMM) [24]. VQ can be considered as hard
decision modeling and the GMM can be considered as soft
decision modeling with an underlying probability distribution.
The GMM classifier is a popular method for text-independent
speaker recognition [25] and dialect classification [26]. We use
the GMM classifier as our baseline system. Fig. 2 shows the
block diagram of the baseline GMM training system, where
is the number of predefined dialects. The GMM model for
dialect is trained with data from dialect . The training method
is a generalized maximum likelihood estimation (MLE)—the
expectation maximization (EM) algorithm [6], [24]. In our
study, the GMM trained with MLE (i.e., MLE-GMM) is applied to the baseline system. It also serves as the initial model
for the discriminative training presented in Section III. A
GMM-based gender classifier is trained similarly and is applied
prior to dialect classification. Fig. 3 shows the block diagram of
the unsupervised GMM-based dialect classification system. We
will describe the silence remover and feature extraction steps
in the experimental section.
There have been several successful techniques for unsupervised language identification, which might be applicable to
unsupervised dialect classification. For example, a number of
popular methods are based on phone recognition such as single
language Phone Recognition followed by language-dependent
2446
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007
Fig. 3.
GMM-based unsupervised dialect classification system.
language modeling (PRLM), parallel PRLM, and language-dependent parallel phone recognition (PPR) [16], [31]. The
basic idea is to analyze the phoneme sequence of the audio.
The phoneme sequence is obtained through one or a parallel
set of phoneme recognizers trained with outside data. The
difference of the phoneme sequence is extracted through the
n-gram language models. The focus of our study is to develop
better estimation algorithms than MLE for unsupervised dialect
classification. Therefore, the GMM trained with the MLE
algorithm is considered the baseline system in our study.
III. DISCRIMINATIVE TRAINING FOR UNSUPERVISED
DIALECT CLASSIFICATION
As discussed in the introduction, some acoustic events correspond to dialect-dependent information, while other acoustic
events might distract or even confuse the dialect classification
procedure. Here, we view the Gaussian mixtures in the GMM
as corresponding to the acoustic events in the audio as suggested in [25]. We assume that some mixtures correspond to
acoustic events which contribute to dialect classification, and
other mixtures correspond to acoustic events which distract
the model from dialect classification. Fig. 4(a) represents the
GMMs trained using MLE method as discussed in Section II.
The overlapped regions of the two GMMs represent the regions
(Gaussian mixtures) which distract the model from dialect
classification. The important topic is how to deal with the
confused regions in the models, which will be addressed with
several techniques below.
The first technique is to remove the confused region so that
the model includes only dialect-dependent (discriminative) mixtures [see Fig. 4(b)]. This technique is called Mixture Selection
(MS), which is proposed in Section III-A. The second technique
is to average out the confused region, so that the confused region maps the confused acoustic events, but will not impact the
classification procedure [see Fig. 4(c)]. This technique is implemented through frame selection (FS) in the data space, which is
proposed in Section III-B. A common way to deal with the confused region in the models is to separate the models as much as
possible from each other so that the confused region can be reduced [see Fig. 4(d)]. The well-known MCE [17], [18] is such
a technique and reviewed in Section III-C. Another interesting
discriminative training algorithm for GMM is maximum mutual information (MMI) estimation [22], [23], which has been
applied to language identification successfully [1], [21]. MMI
training can also reduce the confused region as MCE training,
and therefore it can be applied here as well.
Fig. 4. Relationship among MLE-GMM training, MS-GMM training,
FS-GMM training, and MCE-GMM training. (a) MLE-GMM. (b) MS-GMM.
(c) FS-GMM. (d) MCE-GMM. (e) MCE-MS-GMM. (f) MCE-FS-GMM.
Fig. 5. Discriminative GMM training based on Gaussian mixture selection
(MS-GMM).
A. Gaussian Mixture Selection on GMM Training (MS-GMM)
The Gaussian mixtures represent the acoustic space of the
training data. We expect some Gaussian mixtures will represent
dialect-dependent acoustic characteristics, and others to be less
dialect-dependent and could in fact cause confusion for dialect
classification. In this section, we formulate a scheme to detect
the most dialect-dependent Gaussian mixtures and sort the mixtures according to their discriminating abilities. The new GMM
is then obtained by selecting the top discriminative mixtures.
Fig. 5 shows the block diagram of Gaussian mixture selection
for GMM training. The starting GMMs are the baseline GMMs
trained using MLE, as shown in Section II.
be the GMM of dialect , and
Let
be the th Gaussian component of
, where
,
, and
is the
number of dialects,
is the number of mixtures for each
dialect GMM. While the number of mixtures can be different,
HUANG AND HANSEN: UNSUPERVISED DISCRIMINATIVE TRAINING WITH APPLICATION TO DIALECT CLASSIFICATION
2447
we use the same number of mixtures for each dialect GMM in
our study. For the th Gaussian component in the th GMM
,
is the Gaussian component weight, and
is the corresponding Gaussian distribution with mean vector
, and associated diagonal covariance matrix
. The
for dialect
number of speech frames in the training data is
, and the total number of speech frames in the training data
. The discriminating ability of the Gaussian
is
is defined as
component
(1)
where
nent
is the weighted density of Gaussian compogenerating speech frame , which is defined as
(2)
Here, the weight term
in (1) is defined as shown in (3), at the
,
,
bottom of the page, where
is the number of dimensions of the feature vector, and
(4)
The larger the value of
, the larger the discriminating
ability of the th Gaussian component in the th dialect GMM.
For each GMM, the mixtures are sorted based on this discriminating ability measure. The new GMM is formulated by
selecting the top discriminative mixtures in the prior GMM,
and the weights are recalculated in order to ensure
in the new GMM. As shown in Fig. 4(b), the confused region in
the acoustic model space is therefore excluded. The evaluation
process is performed in exactly the same as the baseline GMM
classification system. In our study, we formulate four variations
of the above scheme. If we remove the probability term in
is actually the discriminative
the right-hand side of (1),
speech frame count for the th Gaussian component in the
th dialect GMM. We name the original
as the probability
score. The frame count and probability score calculated above
term in (3), the
are therefore raw values. If we remove the
calculated frame count and probability score are referred to as
is always non-negative.
the normalized values. In this case,
In summary, the Gaussian mixture discriminating ability is
measured as four different scores: the raw frame count (RFC),
the normalized frame count (NFC), the raw probability score
(RPS), and the normalized probability score (NPS). These four
scores represent methods on calculating the discriminating
Fig. 6. Discriminative GMM training based on frame selection (FS-GMM).
ability and they are not scale factors on the calculation. Therefore, the number of mixtures in a GMM has little impact on the
choice of measurement. In Section IV-B, we will describe how
we pick the measurement for MS-GMM.
B. Frame Selection on GMM Training (FS-GMM)
In addition to extracting the dialect-dependent information
in the acoustic space represented by the Gaussian mixtures, we
can extract the dialect-dependent discriminative information
from training data directly. Alternatively, we can remove the
most confusing speech frames in the training data and train new
GMMs with the remaining data (i.e., this algorithm is termed as
FS-GMM). The speech frame is called a confusing frame if
and
(5)
When the number of consecutive confusing frames over time is
greater than a predefined threshold (0.1 s in our study, which
represents ten acoustic frames), we believe these frames to be
“garbage” frames. After removing the garbage frames, a new
GMM (i.e., discriminative GMM) is trained using the remaining
(i.e., discriminative) speech frames in that dialect. The discriminative GMM is trained for each dialect. The garbage frames
from all dialects are grouped together and a garbage GMM is
trained. The final GMM system is obtained by combining the
discriminative GMM with the garbage GMM model. There are
prior probabilities for model combining, which determines how
much weight will be assigned to the discriminative GMM versus
the garbage GMM. This prior probability can only be determined empirically. In the experimental section, the performance
of FS-GMM based on different prior probabilities will be presented. Fig. 6 shows the block diagram of the GMM training
based on frame selection.
In the combined new model, the Gaussian mixtures from the
garbage model will neither help discriminate the dialect classes,
nor distract the classification, but they will contribute to map the
if
, and
, and
class ;
if
, and
, and
class ;
else
class
(3)
2448
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007
confused acoustic events. The dialect-dependent acoustic events
are mapped to the Gaussian mixtures from the discriminative
model. As shown in Fig. 4(c), the confused region in the models
are therefore balanced and included.
iteratively adapting the set of parameters (including mean, covariance, and mixture weight of the GMM) using gradient probabilistic descent (GPD) [3], [18] according to
(11)
C. Minimum Classification Error (MCE) on GMM Training
A common way to deal with the confused region in the model
is to reduce it by separating the models. MCE is suitable in
this case [4], [17], [18]. MCE training separates the models
as far away as possible so the confused region is reduced [see
Fig. 4(d)].
Let us define the discriminant function as the log likelihood
function in the GMM as follows:
(6)
Here, is the observation sequence of
,
is the
GMM of th dialect class, the number of dialect classes is ,
is the number of mixtures in the GMM, is the observation
at time , is the total number of observations, and
is the weighted density of Gaussian component
generating
the speech frame and is defined in (2). The misclassification
measure is defined as
(7)
where the anti-discriminant function
is the log
likelihood of the most competing class. The misclassification
measure is transformed into a smooth loss function by using
is
a sigmoid function, and the zero-one loss function
defined as follows:
(8)
where is a slope coefficient of the sigmoid function, which
controls the steepness of the sigmoid function and is a positive
number. The larger the value of , the steeper the sigmoid funcis a large negative number, which indicates
tion. When
has a value close
correct recognition, the loss function
to zero, which implies that no loss was incurred. On the other
is a positive number, the loss function
hand, when
has a value between 0.5 and 1 which indicates the occurrence of an error. The overall empirical loss associated with
the given observation sequence is given by
(9)
where
is a Boolean function which will return 1 if is
from class and 0 otherwise. Thus, the objective of the first stage
is to minimize the expected loss function defined as
(10)
The parameter here can be estimated by first choosing an initial estimate (a MLE-trained GMM is used in our study) and
where
is the learning rate,
is the gradient of the
loss function, and
is the iteration count. This represents
the MCE-GMM training scheme for dialect classification in
Fig. 4(d).
D. MS/FS After MCE on GMM Training
MCE is a technique to separate the acoustic spaces represented by the GMMs as far as possible in the minimum
classification error criterion. It will change all the parameter
values in the GMMs (mean, covariance, and mixture weight),
but it does not necessarily mean that a partial list of Gaussian
mixtures in the MCE-trained GMM is sufficient for classification. Actually, the whole list of Gaussian mixtures in the
MCE-trained GMM should be present for classification, since
this is the domain MCE criterion is applied. MCE training can
reduce the confused region as shown in Fig. 4(d) by pulling the
models away from each other. However, the confused regions
are still present. Therefore, MS or FS training can be applied
after the MCE training. Since MS or FS training removes or
balances the confused regions, applying MCE after MS or FS
is not favorable. If MS training or FS training is applied after
MCE training, the excluded region or shared region in the
new GMMs after MCE training will be smaller than directly
applying MS training or FS training on the original models. In
Fig. 4(e) and (f), the excluded/shared region is smaller than the
excluded/shared region in Fig. 4(b) and (c). Since the model
structures from Fig. 4(e) and (f) are changed less, we expect
that applying MS or FS after MCE training (i.e., MCE-MS or
MCE-FS) will have less impact on the models than directly
applying MS or FS, so that MCE-MS or MCE-FS training can
achieve better performance than MS or FS training alone. Certainly, applying MS or FS after MCE will usually not achieve
additional improvement from MCE than applying MS or FS
directly after the ML trained models since the MCE training
has increased the model discriminability to some extent. These
arguments will be investigated in Section IV.
IV. EXPERIMENTS
In this section, the proposed algorithms are evaluated using
dialect data from English and Spanish. There are no associated transcripts for the training or test data in all languages
and dialects. The English data and Spanish data are spontaneous speech, (i.e., one person from the dialect region talking
spontaneously).
The feature used in our study is the well-known Mel-frequency cepstral coefficients (MFCCs) represented as . The
plus their delta coefficients are used.
log energy and
In total, 26 coefficients are used to represent each frame. The
window size for a frame is 20 ms, with a skip rate of 10 ms.
Therefore, the frame rate is 100 frames per second.
The static training scheme is applied for MLE, MS, FS, and
MCE GMM training, (i.e., all the training data is presented to
HUANG AND HANSEN: UNSUPERVISED DISCRIMINATIVE TRAINING WITH APPLICATION TO DIALECT CLASSIFICATION
2449
TABLE I
THREE-DIALECT SPANISH CORPUS USED IN OUR STUDY
the system at training). In the evaluation, the large test audio files
are partitioned into small utterances. The test utterance length is
10 s in duration unless otherwise specified. The final classification performance is the average on all test utterances.
A. Dialect Corpora: Spanish and English
The first corpus used in our study consists of Latin American Spanish dialect speech from Cuba, Peru and Puerto Rico,
which is partially described in [31]. The spontaneous speech
portion was recorded in an interview style. The interviewer gave
sample topics such as “describe your family”, and the subject
would respond. The interviewer would also give hints to keep
the subject talking. The subject used a head-mounted microphone, which also captured the speech from the interviewer at
a much lower amplitude since the interviewer sat far away from
the microphone. The speech from both the interviewer and the
subject were recorded on the same channel. We note that there
were also long periods of silence in the audio. To address the
above issues, we employ a silence remover to eliminate long
silence and potential speech from the interviewer. The silence
remover is based on an overall energy measure. Table I summarizes training and test data after the silence removal process. PR
represents the dialect from Puerto Rico. The speakers used for
training and testing are roughly balanced between male and female speakers. Since the size of the corpus is limited, we do not
set aside data for development use, and therefore all of the data
is used either in training or test. We will illustrate several combinations of parameters in the experimental results. In an actual
application, a development data set can be applied to select the
best parameters.
The second corpus used is IViE [28], which consists of eight
British English dialects. We select three dialects for use in our
study based on geographical origins, since we have three dialects for Spanish. These dialects are from Belfast (Northern
Ireland), Cambridge (England), and Cardiff (Wales). In each dialect, there are four male and four female speakers for training;
and two male and two female speakers for testing. The training
data is 57 minutes in total duration; the test data is 30 min in
total duration. Therefore, the IViE corpus is a relatively small
size corpus.
B. Probe Experiments
For the proposed algorithms, there are several parameter settings required before formal evaluation. These parameters include the discriminative measure in Section III-A, the range
for mixture selection, sigmoid steepness coefficient , learning
rate in MCE training and the number of mixtures for GMM
training. Since the IViE corpus is too small to be representative,
Fig. 7. Gaussian mixture sorting using the four discriminative measures.
(a) Based on raw frame count (RFC). (b) Based on normalized frame count
(NFC). (c) Based on raw probability score (RPS). (d) Based on normalized
probability score (NPS).
Fig. 8. Classification accuracy of MS-GMM by selecting different percentage
of discriminative mixtures. -axis: percentage of discriminative mixtures being
selected (from 10% to 100%. When selecting 100% mixtures, it is the baseline
model); -axis: classification accuracy (%).
Y
X
Spanish data is used here for probe experiments, with settings
held constant for both English and Spanish corpora.
The first experiment is to determine which discriminative
measure in Section III-A can sort the mixtures consistently
for both training and unseen test data. Therefore, the most
confusing mixtures can be excluded and a new set of GMMs
generated. Fig. 7 shows mixture sorting using the four discriminative measures described in Section III-A (RFC, NFC, RPS,
and NPS). The points in the figures are the numerical labels
of mixtures. Different shapes of points mean different dialect
and axes represent the number of mixtures.
classes. The
We use 200 mixtures for the baseline GMM training in Figs. 7,
8 and Table II. Fig. 7 is generated by first using the training
data as the input to sort the mixtures of the baseline GMMs and
2450
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007
TABLE II
CLASSIFICATION ACCURACY OF THE FOUR MS-GMM FORMULATED SCHEMES;
HERE THE TOP 75% OF THE MIXTURES ARE SELECTED
obtain new GMMs, then using the test data as the input to sort
the mixtures of the new GMMs, the sequence of mixtures are
drawn in the figure. If the top discriminative mixtures in the
training data are also the top discriminative mixtures in the test
. If
data, the points in the figure will cluster near the line
the top discriminative mixtures in the training data are the least
discriminative mixtures in the test data, the points will cluster
. The ideal case is that all points will
along the line
. From Fig. 7, we observe that mixture
fall on the line
sorting based on the normalized probability score can identify
the top discriminative mixtures in a consistent manner for
both training and test data. It also shows that the development
data is not necessary for mixture sorting, since the Gaussian
mixtures can be sorted using the training data for the original
GMMs. Table II shows the classification accuracy of the mixture selected GMM (MS-GMM) with the four discriminative
measures. We pick the top 75% of the mixtures for the new
GMMs. We observe that all four mixture selection schemes
can improve dialect classification accuracy. The normalized
probability score is the best scheme for sorting the mixtures.
We will use this scheme for the following experiments.
Next, we show how the percentage of selected mixtures affects classification performance. Fig. 8 shows classification accuracy of MS-GMM formulated by selecting different percentages of the top mixtures. When the top 100% of the mixtures are
selected, the system reduces to the baseline GMM. From Fig. 8,
we observe that: 1) we must retain at least 50% of the original
mixtures and 2) the classification performance improves if the
least discriminative mixtures (i.e., the most confusing mixtures)
are removed.
We also train 300, 600, and 1000 size mixtures for the baseline GMMs, and the top 50% to 100% of the mixtures are selected based on the normalized probability score measure. The
results are shown in Fig. 9. When 100% of the Gaussian mixtures are selected, the system reduces to the baseline GMM. The
classification accuracy of the baseline 300, 600, and 1000 mixture GMM is 73.7%, 74.3%, and 73.7%, respectively, as shown
in Fig. 9 (i.e., the case where the percentage of mixtures selected is 100%). We observe several interesting results: 1) when
selecting 50% to 95% of the sorted mixtures, the new GMMs
outperform the baseline GMMs and 2) if more mixtures are in
the baseline GMMs, a smaller portion of the sorted mixtures
need be retained to obtain the best GMMs. In Fig. 9, we observe
that when 65% of the mixtures are selected for the 1000-mixture
baseline GMMs, the new GMMs achieve the best performance;
for the 600-mixture GMMs and 300-mixture GMMs, the new
GMMs achieve the best performance when 75% and 80% of
the mixtures are selected, respectively.
There are several parameter settings required for MCE
training. Based on the previous experiments [4], [17], [18] and
our experience, the learning rate for MCE training should be less
than one and scaled down gradually. We set the initial learning
Fig. 9. Classification accuracy of the MS-GMM by selecting different percentage of discriminative mixtures. The initial GMM has 300, 600, and 1000
mixtures, respectively. X -axis: percentage of discriminative mixtures being selected (from 50% to 100%. When selecting 100% mixtures, it is the baseline
model); Y -axis: classification accuracy (%).
TABLE III
SPANISH DIALECT CLASSIFICATION PERFORMANCE OF MCE TRAINED GMM
CLASSIFIER USING DIFFERENT VALUES IN THE SIGMOID FUNCTION
rate to 0.5, and the learning rate is scaled down by 0.8 after
every third iteration. We determined that the sigmoid steepness
coefficient in (8) is sensitive to overall performance. Table III
shows dialect classification performance when different values
of are used in MCE training. Here, the 200-mixture GMM
is used in this experiment. The baseline classification accuracy
is 73.5%, as shown in Table II. The initial learning rate is 0.5
and the scale-down factor is 0.8. From Table III, we find that
is a good value for MCE training. It is also observed
that updating the mixture variance in the GMMs is also useful
for dialect classification. Therefore, the final parameters for
is 100, initial
MCE training in our study are as follows:
learning rate is 0.5, scale-down factor is 0.8, the number of
iterations is set to 10, and variance updating is employed.
The last probe experiment is to test how well the FS
technique will help improve the GMM-based classifier. The
FS-based GMM (FS-GMM) training method will produce a
discriminative model and a garbage model. The final model is
obtained by combining the discriminative model and garbage
model with predefined prior probabilities. Here, we still use
the 300, 600, and 1000 mixtures for the baseline GMMs in this
experiment. As shown in Fig. 9, the 300, 600, and 1000-mixture MLE-GMM-based classifiers achieve 73.7%, 74.3%, and
73.7% accuracy, respectively. As for the newly trained GMMs,
the discriminative model has the same number of mixtures as
the baseline GMMs, and the garbage model has one-third the
number of mixtures as the baseline GMMs. Fig. 10 shows the
Spanish dialect classification accuracy of the combined models
generated with different prior probabilities. From Fig. 10, we
HUANG AND HANSEN: UNSUPERVISED DISCRIMINATIVE TRAINING WITH APPLICATION TO DIALECT CLASSIFICATION
Fig. 10. Spanish dialect classification accuracy of the FS-GMM by setting different prior probabilities for discriminative model and garbage model. The initial GMM has 300, 600, and 1000 mixtures, respectively. X -axis: prior probability for the discriminative model in the combination (from x = 0:05 to 1), the
prior probability for the garbage model is 1 x; Y -axis: classification accuracy
(%).
2451
0
observe several interesting results: 1) the discriminative model
alone only achieves marginal improvement (i.e., the prior
probability is 1 for the discriminative model in Fig. 10) and 2)
the “garbage” model does help dialect classification.
Fig. 11. Classification accuracy of the proposed algorithms on English [(a)
and (b)] and Spanish [(c) and (d)] dialect corpora. Evaluated algorithms include MLE-GMM, MS-GMM, FS-GMM, MCE-GMM, MCE-MS-GMM, and
MCE-FS-GMM.
TABLE IV
CLASSIFICATION ACCURACY (%) OF PROPOSED ALGORITHMS
ON THE TWO CORPORA (SUMMARIZED VERSION OF FIG. 11)
C. Evaluation on English and Spanish Dialect Corpora
Having established the parameter settings for MS-GMM,
FS-GMM, and MCE-GMM, we now turn to formal evaluation on the two dialect corpora: English and Spanish. From
Section IV-B, it is observed that the mixture number for the
GMM does not significantly impact performance if the value
is in a reasonable range (say, from 300 to 1000). We use 600
mixtures in the GMMs for our evaluation here. The settings for
MCE are the same as in Section IV-B.
Fig. 11 shows classification accuracy of the proposed algorithms on the two dialect corpora. Fig. 11(a) and (b) uses
English dialects, while Fig. 11(c) and (d) uses Spanish dialects.
All six algorithms shown in Fig. 4, Sections II, and III are
evaluated (MLE-GMM, MS-GMM, FS-GMM, MCE-GMM,
MCE-MS-GMM, and MCE-FS-GMM). In Fig. 11, MS-GMM
means classifying using the Mixture Selection (MS) trained
GMMs (see Fig. 4(b) and Section III-A); FS-GMM means
classifying using the Frame Selection (FS) trained GMMs
(see Fig. 4(c) and Section III-B); MCE-MS-GMM means
MS applied on the MCE trained GMMs (see Fig. 4(e) and
Section III-D); MCE-FS-GMM means FS applied on the MCE
trained GMMs (see Fig. 4(f) and Section III-D). The -axis of
the left-part sub-figures means percentage of top discriminative
mixtures being selected (from 50% to 100%). When 100%
of the mixtures are selected, MS-GMM actually degenerates
to the baseline GMM (i.e., MLE-GMM, see Fig. 4(a) and
Section II), and MCE-MS-GMM degenerates to MCE-GMM
(see Fig. 4(d) and Section III-C). The -axis of the right-part
sub-figures means prior probability for the discriminative
model in the combination (from
to 1), the prior
. The -axis of all
probability for the garbage model is
the sub-figures is classification accuracy (%). From Fig. 11,
we find that MS-GMM and FS-GMM achieve significant improvement in dialect classification. MCE training also improves
classification performance. When MS or FS techniques are
applied after MCE training, further improvement is achieved.
Table IV summarizes the best performance from Fig. 11 for
MS-GMM, FS-GMM, MCE-MS-GMM, and MCE-FS-GMM.
From Table IV, the best systems achieve 43.7% and 16.4%
relative error reduction versus the baseline MLE-GMM system
on English and Spanish dialect corpora, respectively.
D. Comparison With the Human Listener Test
It is interesting to assess how well machine performance
compares with human performance. From our prior research,
Ikeno and Hansen [15] conducted a listener test using the same
three British dialects data from the IViE corpus. They recruited
11 listeners from the United Kingdom (U.K.), 11 listeners from
the United States (U.S.), and 11 listeners from non-English
speaking countries. All listeners were either college students
or faculty, and therefore highly educated with potentially some
exposure to British dialects, especially for those listeners originally from the U.K.. An approximately 60-s-long audio file per
2452
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007
TABLE V
FS-GMM SYSTEM VERSUS LISTENERS FROM U.K.
ON THE THREE BRITISH DIALECTS
dialect was provided for listener training and reference. These
audio files were always available for reference during testing.
Each test utterance had nine words on average, 3 s long in duration. From [15], the listeners from U.K. perform much better
than listeners from the U.S. or non-English speaking countries.
The classification accuracy of British listeners, American
listeners, and non-native listeners is 82.3%, 55.7%, and 45.0%
respectively. We evaluate the proposed automatic dialect
classification system using the same 3-s-long test utterances
for comparison. Table V shows the comparison of listeners
from U.K., the MLE-GMM (i.e., baseline GMM system), and
proposed FS-GMM system. The number of mixtures for the
GMM is held the same at 600. The performance of FS-GMM
listed in Table V can be achieved by setting the prior probability of discriminative GMM from 0.5 to 0.7 in FS-GMM.
From Table V, we observe that the MLE-trained GMM-based
classifier achieves comparable performance versus human, with
British listeners having more trouble distinguishing Cardiff
dialect, and MLE-GMM having more difficulty with Belfast
dialect. The proposed FS-GMM system clearly outperforms the
human listeners (7.4% absolute improvement) and MLE-GMM
system (6.8% absolute improvement), with consistent improvement for Belfast versus MLE-GMM, and for Cardiff versus
British listeners. This suggests that the difficulties in dialect
perception of Cardiff for British listeners and modeling limitations for Belfast dialect for MLE-GMM are simultaneously
addressed with FS-GMM.
V. CONCLUSION
Speech produced under different dialects of a language is
represented using a range of acoustic events. These acoustic
events can be separated into dialect discriminating content,
and dialect neutral/distractive content. In the Gaussian mixture
model space, the dialect distractive content is the confused
region of the models, which can be excluded, included, or
reduced. Three corresponding techniques were developed in
this study named: Mixture Selection (MS), Frame Selection
(FS), and MCE (i.e., MS-GMM, FS-GMM, and MCE-GMM).
The MLE-trained GMM-based system representing the baseline system is easy to implement and can achieve reasonable
performance (see Section IV-D). The GMM trained with MS
or FS achieves measurable performance improvement over the
MLE trained GMM system. The MCE-trained system achieves
similar performance compared with MS or FS trained systems.
However, MS or FS training has a lower computational load
than MCE training. Therefore, MS or FS training is a better
alternative than MCE training for GMM-based dialect classification if fast computation is needed. During the test, the
computational load for MS, FS, and MCE is almost the same.
Interestingly, MS or FS training and MCE training can be
performed sequentially, resulting in further improvement. Evaluated on English dialects and Spanish dialects, the proposed
MCE-MS-GMM or MCE-FS-GMM methods achieve 43.7%
and 16.4% relative error reduction versus the MLE-GMM
system on English and Spanish corpora respectively. The
FS-GMM system achieves 41.8% relative error reduction
comparing to the (knowledgeable) human listeners in British
dialect classification. Therefore, the proposed automatic dialect
classification systems will be beneficial in real-world speech
applications.
REFERENCES
[1] B. Lukas, M. Pavel, and C. Jan, “Discriminative training techniques
for acoustic language identification,” in Proc. IEEE Int. Conf. Acoust.,
Speech, Signal Process., Toulouse, France, 2006, pp. 209–212.
[2] D. L. Canfield, Spanish Pronunciation in the Americas. Chicago, IL:
Univ. Chicago Press, 1981.
[3] E. W. Cheney, Introduction to Approximation Theory. New York:
AMS/Chelsea, 1982.
[4] W. Chou, “Discriminant-function-based minimum recognition error
rate pattern-recognition approach to speech recognition,” Proc. IEEE,
vol. 88, no. 8, pp. 1201–1223, Aug. 2000.
[5] D. Crystal, A Dictionary of Linguistics and Phonetics. Malden, MA:
Blackwell, 1997.
[6] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc., vol. 39, pp.
1–38, 1977.
[7] V. Diakoloukas, V. Digalakis, L. Neumeyer, and J. Kaja, “Development of dialect-specific speech recognizers using adaptation methods,”
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Munich, Germany, Apr. 1997, vol. 2, pp. 1455–1458.
[8] R. O. Duda, P. E. Hart, and D. G. P. C. Stork, Pattern Classification,
2nd ed. New York: Wiley, 2001.
[9] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression, 6th ed. Norwell, MA: Kluwer, 1991.
[10] V. Gupta and P. Mermelstein, “Effect of speaker accent on the performance of a speaker-independent isolated word recognition,” J. Acoust.
Soc. Amer., vol. 71, pp. 1581–1587, 1982.
[11] C. Huang, T. Chen, S. Li, E. Chang, and J.-L. Zhou, “Analysis of
speaker variability,” in Proc. Eur. Conf. Speech Commun. Technol.,
Aalborg, Denmark, Sep. 2001, vol. 2, pp. 1377–1380.
[12] R. Huang and J. H. L. Hansen, “Advances in word based dialect/accent classification,” in Proc. Interspeech-Eurospeech, Lisbon, Portugal,
Sep. 2005, pp. 2241–2244.
[13] R. Huang and J. H. L. Hansen, “Dialect/accent classification via
boosted word modeling,” in Proc. IEEE Int. Conf. Acoustics, Speech,
Signal Process., Philadelphia, PA, Mar. 2005, vol. 1, pp. 585–588.
[14] J. J. Humphries and P. C. Woodland, “The use of accent-specific pronunciation dictionaries in acoustic model training,” in Proc. IEEE Int.
Conf. Acoust., Speech, Signal Process., Seattle, WA, May 1998, vol. 1,
pp. 317–320.
[15] A. Ikeno and J. H. L. Hansen, “Perceptual recognition cues in native
English accent variation: Listener accent perceived accent, and comprehension,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
Toulouse, France, May 2006, pp. I-401–I–404.
[16] A. S. Jayram, V. Ramasubramanian, and T. Sreenivas, “Language identification using parallel sub-word recognition,” in Pro. IEEE Int. Conf.
Acoust., Speech, Signal Process., Hong Kong, China, Apr. 2003, vol.
1, pp. 32–35.
[17] B.-H. Juang, W. Chou, and C.-H. Lee, “Minimum classification error
rate methods for speech recognition,” IEEE Trans. Speech Audio
Process., vol. 5, no. 3, pp. 257–265, May 1997.
[18] B.-H. Juang and S. Katagiri, “Discriminative learning for minimum
error classification,” IEEE Trans. Signal Process., vol. 40, no. 12, pp.
3043–3054, Dec. 1992.
[19] J. M. Lipski, Latin American Spanish. London, U.K.: Longman,
1994.
[20] M.-K. Liu, B. Xu, T.-Y. Huang, Y.-G. Deng, and C.-R. Li, “Mandarin
accent adaptation based on context-independent/ context-dependent
pronunciation modeling,” in Proc. IEEE Int. Conf. Acoust., Speech,
Signal Process., Istanbul, Turkey, Jun. 2000, vol. 2, pp. 1025–1028.
HUANG AND HANSEN: UNSUPERVISED DISCRIMINATIVE TRAINING WITH APPLICATION TO DIALECT CLASSIFICATION
[21] M. Pavel, B. Luk, S. Petr, and C. Jan, “NIST language recognition evaluation 2005,” in Proc. NIST LRE 2005, Washington, DC, 2006, pp.
1–37.
[22] A. Nadas, “A decision-theoretic formulation of a training problem
in speech recognition and a comparison of training by unconditional
versus conditional maximum likelihood,” IEEE Trans. Acoust.,
Speech, Signal Process., vol. ASSP-4, no. 4, pp. 814–817, Aug. 1983.
[23] D. Povey, “Discriminative training for large vocabulary speech recognition,” Ph.D. dissertation, Cambridge Univ., Cambridge, U.K., 2004.
[24] D. Reynolds and R. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Speech
Audio Process., vol. 3, no. 1, pp. 72–83, Jan. 1995.
[25] D. A. Reynolds, T. F. Quatieri, and R. B. D. Speaker, “Verification
using adapted Gaussian mixture models,” Dig. Signal Process., vol. 10,
no. 1–3, pp. 19–41, Jan./Apr./Jul. 2000.
[26] P. A. Torres-Carrasquillo, T. P. Gleason, and D. A. Reynolds, “Dialect
identification using Gaussian mixture models,” in Proc. Odyssey: The
Speaker and Language Recognition Workshop, 2004, pp. 297–300.
[27] W. Ward, H. Krech, X. Yu, K. Herold, G. Figgs, A. Ikeno, D. Jurafsky,
and W. Byrne, “Mandarin accent adaptation based on context-independent/ context-dependent pronunciation modeling,” in Proc. ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, Estes Park,
CO, Sep. 2002.
[28] Web: IViE-British Dialect Corpus [Online]. Available: http://www.
phon.ox.ac.uk/esther/ivyweb/
[29] J. C. Wells, Accents of English. Cambridge, U.K.: Cambridge Univ.
Press, 1982, vol. I, II, III.
[30] L. R. Yanguas, G. C. O’Leary, and M. A. Zissman, “Incorporating linguistic knowledge into automatic dialect identification of Spanish,” in
Proc. Int. Conf. Spoken Lang. Process., Syndey, Australia, Nov. 1998.
[31] M. A. Zissman, T. P. Gleason, D. M. Rekart, and B. L. Losiewicz, “Automatic dialect identification of extemporaneous conversational, Latin
American Spanish speech,” in Proc. IEEE Int. Conf. Acoust., Speech,
Signal Process., 1996, vol. 2, pp. 777–780.
[32] P. Angkititrakul and J. H. L. Hansen, “Advances in phone-based modeling for automatic accent classification,” IEEE Trans. Audio, Speech,
Lang. Process., vol. 14, no. 2, pp. 634–646, Mar. 2006.
Rongqing Huang (S’01–M’06) was born in China
in 1979. He received the B.S. degree from the
University of Science and Technology of China
(USTC), Hefei, China, in 2002, and the M.S. and
Ph.D. degrees from the University of Colorado
at Boulder in 2004 and 2006, respectively, all in
electrical engineering.
In 2006, he joined the Acoustic Modeling Group,
Dragon and Healthcare R&D, Nuance Communications, Burlington, MA, as a Research Scientist. He
works on large vocabulary acoustic modeling for network, medical, and other Nuance speech recognition products. From 2000 to
2002, he worked in the USTC iFlyTek Speech Lab. From 2002 to 2005, he
worked in the Robust Speech Processing Group, Center for Spoken Language
Research, University of Colorado at Boulder. He was a Ph.D. Research Assistant in the Department of Electrical and Computer Engineering. In 2005, he was
a summer intern with Motorola Labs, Schaumburg, IL. He was a research intern
in the Center for Robust Speech Systems at the University of Texas at Dallas
from 2005 to 2006. His research interests include the general areas of large vocabulary speech recognition, machine learning, and robust signal processing.
2453
John H. L. Hansen (S’81–M’82–SM’93–F’07)
received the B.S.E.E. degree from the College of
Engineering, Rutgers University, New Brunswick,
NJ, in 1982 and the M.S. and Ph.D. degrees in
electrical engineering from Georgia Institute of
Technology, Atlanta, in 1983 and 1988, respectively.
He joined the Erik Jonsson School of Engineering
and Computer Science, University of Texas at Dallas
(UTD), Richardson, in the fall of 2005, where he is
Professor and Department Chairman of Electrical
Engineering, and holds the Distinguished University
Chair in Telecommunications Engineering. He also holds a joint appointment as
Professor in the School of Brain and Behavioral Sciences (Speech and Hearing).
At UTD, he established the Center for Robust Speech Systems (CRSS), which
is part of the Human Language Technology Research Institute. In 1988, he
established the Robust Speech Processing Laboratory (RSPL) and continues
to direct research activities in the CRSS at UTD. Previously, he served as
Department Chairman and Professor in the Department of Speech, Language,
and Hearing Sciences (SLHS) and as Professor in the Department of Electrical
and Computer Engineering, both at the University of Colorado at Boulder
(1998–2005), where he cofounded the Center for Spoken Language Research.
He has also served as a Technical Advisor to the U.S. Delegate for NATO
(IST/TG-01), His research interests span the areas of digital speech processing,
analysis, and modeling of speech and speaker traits, speech enhancement,
feature estimation in noise, robust speech recognition with emphasis on
spoken document retrieval, and in-vehicle interactive systems for hands-free
human–computer interaction. He has supervised 39 (18 Ph.D., 21 M.S.) thesis
candidates and is the author/coauthor of 251 journal and conference papers in
the field of speech processing and communications. He was the coauthor of
the textbook Discrete-Time Processing of Speech Signals (IEEE Press, 2000),
coeditor of DSP for In-Vehicle and Mobile Systems (Springer, 2004), Advances
for In-Vehicle and Mobile Systems: Challenges for International Standards
(Springer, 2007), and lead author of the report “The Impact of Speech Under
’Stress’ on Military Speech Technology” (NATO RTO-TR-10, 2000).
Dr. Hansen has served as the IEEE Signal Processing Society Distinguished
Lecturer for 2005/2006, and as a Member of the IEEE Signal Processing Society Speech Technical Committee and Educational Technical Committee, the
Speech Communications Technical Committee for the Acoustical Society of
America (2000–2003), and the International Speech Communications Association (ISCA) Advisory Council (2004–2010). He has served as Associate Editor
for the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING (1992–1999),
IEEE SIGNAL PROCESSING LETTERS (1998–2000), and as an Editorial Board
Member for the IEEE Signal Processing Magazine (2001–2003). He has also
served as Guest Editor of the October 1994 Special Issue on Robust Speech
Recognition for the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING.
He was the recipient of the 2005 University of Colorado Teacher Recognition
Award as voted by the student body. He also organized and served as General Chair for ICSLP-2002: International Conference on Spoken Language Processing and will serve as Technical Program Chair for IEEE ICASSP-2010, to
be held in Dallas, TX.
Download