JP-53-IEEE-ASLP-Huan.. - The University of Texas at Dallas

2444 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007 Unsupervised Discriminative Training With Application to Dialect Classification Rongqing Huang, Member, IEEE, and John H. L. Hansen, Fellow, IEEE Abstract—Automatic dialect classification has gained interest in the field of speech research because of its importance in characterizing speaker traits and knowledge estimation which could improve integrated speech technology (e.g., speech recognition, speaker recognition). This study addresses novel advances in unsupervised spontaneous dialect classification in English and Spanish. The problem considers the case where no transcripts are available for training and test data, and speakers are talking spontaneously. The Gaussian mixture model (GMM) is used for unsupervised dialect classification in our study. Techniques which aim to deal with confused acoustic regions in the GMMs are proposed, where confused regions in the GMMs are identified through data driven methods. The first technique excludes confused regions by finding dialect dependence in the untranscribed audio by selecting the most discriminative Gaussian mixtures [mixture selection (MS)]. The second technique includes the confused regions in the model, but the confused regions are balanced over all classes. This technique is implemented by identifying discriminative frames and confused frames in the audio data [frame selection (FS)]. The new confused regions contribute to model representation but does not impact classification performance. The third technique is to reduce the confused regions in the original model. Minimum classification error (MCE) is applied to achieve this objective. All three techniques implement discriminative training for GMM-based classification. Both the first technique (MS-GMM, GMM trained with mixture selection) and the second technique (FS-GMM, GMM trained with frame selection) improve dialect classification performance. Further improvement is achieved after applying the third technique (MCE training) before the first or second techniques. The system is evaluated using British English dialects and Latin American Spanish dialects. Measurable improvement is achieved in both corpora. Finally, the system is compared with human listener performance, and shown to outperform human listeners in terms of classification accuracy. Index Terms—Accent and dialect, accent classification, automatic dialect classification, discriminative training, English dialects, frame selection Gaussian mixture model (FS-GMM), Gaussian mixture selection, minimum classification error (MCE), robust speech recognition, Spanish dialects. Manuscript received November 15, 2006; revised June 6, 2007. This work was supported by the U.S. Air Force Research Laboratory, Rome, NY, under Contract FA8750-04-1-0058 and RADC under Contract A40104. Any opinions, findings and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the U.S. Air Force. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Timothy J. Hazen. R. Huang was with the Center for Robust Speech Systems (CRSS), Department of Electrical Engineering, University of Texas at Dallas, Richardson, TX 75083-0688 USA. He is now with Nuance Communications, Burlington, MA 01803 USA. J. H. L. Hansen is with the Center for Robust Speech Systems (CRSS), Department of Electrical Engineering, University of Texas at Dallas, Richardson, TX 75083-0688 USA (e-mail: john.hansen@utdallas.edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2007.903302 I. INTRODUCTION HIS study employs the following definition for the term accent: “The cumulative auditory effect of those features of pronunciation which identify where a person is from regionally and socially. The linguistic literature emphasizes that the term refers to pronunciation only, and is thus distinct from dialect, which refers to grammar and vocabulary as well” [5]. In our study, we feel dialect/accent is a pattern of pronunciation and/or vocabulary of a language used by the community of native/non-native speakers belonging to some geographical region. For example, American English and British English are two dialects of English; English spoken by native Chinese or Germans are two accents of English. Some researchers have a slightly different definition of dialect and accent, depending on whether they approach the problem from a linguistics or speech science/engineering perspective. In our study, we will use “dialect” and “accent” interchangeably, since the formulated algorithms in this study can be applied to both dialect and accent detection. Automatic dialect classification is important for characterizing speaker traits, as well as estimating knowledge that could be used to improve speech system performance (e.g., speech recognition, speech coding, speaker recognition). Dialect/accent is one of the most important factors next to gender that influence automatic speech recognition (ASR) performance [10], [11]. Dialect knowledge could be used in various components of an ASR system such as pronunciation modeling [20], lexicon adaptation [27], and acoustic model training [14] and adaptation [7]. Dialect knowledge could be directly applied in automatic call center and directory lookup service [31]. Effective methods for accent modeling and detection have also been developed, which can also contribute to improved speech systems [32]. Our efforts for dialect identification focus on classifying unconstrained audio, which represents unknown gender, unknown (new) speaker, and unknown text. If transcripts exist for the associated training audio, we have previously proposed a word-based dialect classification (WDC) algorithm which turns the text-independent dialect classification problem into a text-dependent dialect classification problem [13] and achieves very high classification accuracy. If the training data size is too small to train the word specific models, a context-adaptive training (CAT) algorithm was proposed to address this problem and also achieves high classification accuracy [12]. If there are no available transcripts for the training data, the above algorithms cannot be applied, and therefore an unsupervised T 1558-7916/$25.00 © 2007 IEEE HUANG AND HANSEN: UNSUPERVISED DISCRIMINATIVE TRAINING WITH APPLICATION TO DIALECT CLASSIFICATION algorithm must be formulated.1 The Gaussian Mixture Model (GMM)-based classifier has been applied to unsupervised dialect classification [26] and text-independent speaker recognition [25] successfully. In our study, the GMM-based classifier is used for unsupervised dialect classification, with the focus on how to formulate effective discriminative GMM training to achieve improved performance. In our study, three dialects from two languages are considered, which include British English dialects from Cambridge, Belfast, and Cardiff; and Latin American Spanish dialects from Cuba, Peru, and Puerto Rico. The differences in patterns among dialects may differ from one language to another. In many languages, however, dialect differences occur for some phonemes being in certain positions in words or phrases. In English dialects, a rhotic after the vowel (e.g., farm: /F AA R M/ vs. /F AA M/), and syllable boundary (e.g., self#ish vs. sel#fish) are dialect differences visible at the word level [29]. Spanish dialects also concentrate on particular phonemes being at certain positions [2], [19]. For example, /s/ in the syllable final is dropped by Cuban Spanish and is reinforced by Peruvian Spanish. The GMM classifier trained and tested on human labels which only includes this information can achieve 98% accuracy on 20-s audio files [30]. While dialect dependence information is represented by certain acoustic events, it is straightforward that some acoustic events show no difference among dialects, which might distract or even confuse dialect classification. Since there are no transcripts available for the audio training data, an automatic way to identify the dialect-dependent and dialect-confusing acoustic events is valuable. If the acoustic events are represented by Gaussian mixtures in the model space, the identification task can be implemented in the model space as well. In this study, we propose a data-driven method to identify the confused regions in the model space. Next, three distinct techniques are proposed to deal with the confused regions in the model space. The first technique is to exclude the confused regions in the model, and the second technique is to include the confused region which is normalized so that it could map the acoustic events but not impact the classification decision. The third technique is based on the well-known minimum classification error (MCE) criterion, which essentially reduces the confused regions in the model. A very interesting observation is that the first or the second technique can be applied after the third technique. The remainder of this paper is organized as follows. Section II introduces the GMM-based classification system. Section III is dedicated to the three discriminative techniques for GMM training. The first technique—mixture selection-based GMM training (MS-GMM) is proposed in Section III-A; the second technique—frame selection-based GMM training (FS-GMM) is proposed in Section III-B; the third technique— MCE-based GMM training is reviewed in Section III-C. The idea of combining the first or second technique with the third technique is 1The dialect label for the training data is always known. Please note the term “unsupervised training” in this study means transcription-free training. This is different with the traditional definition of “unsupervised training” in the pattern recognition community. 2445 Fig. 1. Naive Bayes model. c is the class label, x is the ith observation vector. Fig. 2. Baseline GMM-based unsupervised training system. presented in Section III-D. Experimental evaluation of the two corpora are presented in Section IV, which includes a comparison between human and machine dialect classification in Section IV-D. Finally, a summary and conclusions are presented in Section V. II. GMM-BASED CLASSIFICATION ALGORITHM Since no transcripts are available for training and test data, it is difficult to build a supervised generative model such as an HMM. The simplest method for unsupervised speech related classification is to apply a naive Bayes classifier [8] (see Fig. 1). In naive Bayes, each observation vector is generated independently. The sequence information of observation vectors is omitted. The typical naive Bayes model in speech processing is vector quantization (VQ) [9] and the Gaussian mixture model (GMM) [24]. VQ can be considered as hard decision modeling and the GMM can be considered as soft decision modeling with an underlying probability distribution. The GMM classifier is a popular method for text-independent speaker recognition [25] and dialect classification [26]. We use the GMM classifier as our baseline system. Fig. 2 shows the block diagram of the baseline GMM training system, where is the number of predefined dialects. The GMM model for dialect is trained with data from dialect . The training method is a generalized maximum likelihood estimation (MLE)—the expectation maximization (EM) algorithm [6], [24]. In our study, the GMM trained with MLE (i.e., MLE-GMM) is applied to the baseline system. It also serves as the initial model for the discriminative training presented in Section III. A GMM-based gender classifier is trained similarly and is applied prior to dialect classification. Fig. 3 shows the block diagram of the unsupervised GMM-based dialect classification system. We will describe the silence remover and feature extraction steps in the experimental section. There have been several successful techniques for unsupervised language identification, which might be applicable to unsupervised dialect classification. For example, a number of popular methods are based on phone recognition such as single language Phone Recognition followed by language-dependent 2446 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007 Fig. 3. GMM-based unsupervised dialect classification system. language modeling (PRLM), parallel PRLM, and language-dependent parallel phone recognition (PPR) [16], [31]. The basic idea is to analyze the phoneme sequence of the audio. The phoneme sequence is obtained through one or a parallel set of phoneme recognizers trained with outside data. The difference of the phoneme sequence is extracted through the n-gram language models. The focus of our study is to develop better estimation algorithms than MLE for unsupervised dialect classification. Therefore, the GMM trained with the MLE algorithm is considered the baseline system in our study. III. DISCRIMINATIVE TRAINING FOR UNSUPERVISED DIALECT CLASSIFICATION As discussed in the introduction, some acoustic events correspond to dialect-dependent information, while other acoustic events might distract or even confuse the dialect classification procedure. Here, we view the Gaussian mixtures in the GMM as corresponding to the acoustic events in the audio as suggested in [25]. We assume that some mixtures correspond to acoustic events which contribute to dialect classification, and other mixtures correspond to acoustic events which distract the model from dialect classification. Fig. 4(a) represents the GMMs trained using MLE method as discussed in Section II. The overlapped regions of the two GMMs represent the regions (Gaussian mixtures) which distract the model from dialect classification. The important topic is how to deal with the confused regions in the models, which will be addressed with several techniques below. The first technique is to remove the confused region so that the model includes only dialect-dependent (discriminative) mixtures [see Fig. 4(b)]. This technique is called Mixture Selection (MS), which is proposed in Section III-A. The second technique is to average out the confused region, so that the confused region maps the confused acoustic events, but will not impact the classification procedure [see Fig. 4(c)]. This technique is implemented through frame selection (FS) in the data space, which is proposed in Section III-B. A common way to deal with the confused region in the models is to separate the models as much as possible from each other so that the confused region can be reduced [see Fig. 4(d)]. The well-known MCE [17], [18] is such a technique and reviewed in Section III-C. Another interesting discriminative training algorithm for GMM is maximum mutual information (MMI) estimation [22], [23], which has been applied to language identification successfully [1], [21]. MMI training can also reduce the confused region as MCE training, and therefore it can be applied here as well. Fig. 4. Relationship among MLE-GMM training, MS-GMM training, FS-GMM training, and MCE-GMM training. (a) MLE-GMM. (b) MS-GMM. (c) FS-GMM. (d) MCE-GMM. (e) MCE-MS-GMM. (f) MCE-FS-GMM. Fig. 5. Discriminative GMM training based on Gaussian mixture selection (MS-GMM). A. Gaussian Mixture Selection on GMM Training (MS-GMM) The Gaussian mixtures represent the acoustic space of the training data. We expect some Gaussian mixtures will represent dialect-dependent acoustic characteristics, and others to be less dialect-dependent and could in fact cause confusion for dialect classification. In this section, we formulate a scheme to detect the most dialect-dependent Gaussian mixtures and sort the mixtures according to their discriminating abilities. The new GMM is then obtained by selecting the top discriminative mixtures. Fig. 5 shows the block diagram of Gaussian mixture selection for GMM training. The starting GMMs are the baseline GMMs trained using MLE, as shown in Section II. be the GMM of dialect , and Let be the th Gaussian component of , where , , and is the number of dialects, is the number of mixtures for each dialect GMM. While the number of mixtures can be different, HUANG AND HANSEN: UNSUPERVISED DISCRIMINATIVE TRAINING WITH APPLICATION TO DIALECT CLASSIFICATION 2447 we use the same number of mixtures for each dialect GMM in our study. For the th Gaussian component in the th GMM , is the Gaussian component weight, and is the corresponding Gaussian distribution with mean vector , and associated diagonal covariance matrix . The for dialect number of speech frames in the training data is , and the total number of speech frames in the training data . The discriminating ability of the Gaussian is is defined as component (1) where nent is the weighted density of Gaussian compogenerating speech frame , which is defined as (2) Here, the weight term in (1) is defined as shown in (3), at the , , bottom of the page, where is the number of dimensions of the feature vector, and (4) The larger the value of , the larger the discriminating ability of the th Gaussian component in the th dialect GMM. For each GMM, the mixtures are sorted based on this discriminating ability measure. The new GMM is formulated by selecting the top discriminative mixtures in the prior GMM, and the weights are recalculated in order to ensure in the new GMM. As shown in Fig. 4(b), the confused region in the acoustic model space is therefore excluded. The evaluation process is performed in exactly the same as the baseline GMM classification system. In our study, we formulate four variations of the above scheme. If we remove the probability term in is actually the discriminative the right-hand side of (1), speech frame count for the th Gaussian component in the th dialect GMM. We name the original as the probability score. The frame count and probability score calculated above term in (3), the are therefore raw values. If we remove the calculated frame count and probability score are referred to as is always non-negative. the normalized values. In this case, In summary, the Gaussian mixture discriminating ability is measured as four different scores: the raw frame count (RFC), the normalized frame count (NFC), the raw probability score (RPS), and the normalized probability score (NPS). These four scores represent methods on calculating the discriminating Fig. 6. Discriminative GMM training based on frame selection (FS-GMM). ability and they are not scale factors on the calculation. Therefore, the number of mixtures in a GMM has little impact on the choice of measurement. In Section IV-B, we will describe how we pick the measurement for MS-GMM. B. Frame Selection on GMM Training (FS-GMM) In addition to extracting the dialect-dependent information in the acoustic space represented by the Gaussian mixtures, we can extract the dialect-dependent discriminative information from training data directly. Alternatively, we can remove the most confusing speech frames in the training data and train new GMMs with the remaining data (i.e., this algorithm is termed as FS-GMM). The speech frame is called a confusing frame if and (5) When the number of consecutive confusing frames over time is greater than a predefined threshold (0.1 s in our study, which represents ten acoustic frames), we believe these frames to be “garbage” frames. After removing the garbage frames, a new GMM (i.e., discriminative GMM) is trained using the remaining (i.e., discriminative) speech frames in that dialect. The discriminative GMM is trained for each dialect. The garbage frames from all dialects are grouped together and a garbage GMM is trained. The final GMM system is obtained by combining the discriminative GMM with the garbage GMM model. There are prior probabilities for model combining, which determines how much weight will be assigned to the discriminative GMM versus the garbage GMM. This prior probability can only be determined empirically. In the experimental section, the performance of FS-GMM based on different prior probabilities will be presented. Fig. 6 shows the block diagram of the GMM training based on frame selection. In the combined new model, the Gaussian mixtures from the garbage model will neither help discriminate the dialect classes, nor distract the classification, but they will contribute to map the if , and , and class ; if , and , and class ; else class (3) 2448 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007 confused acoustic events. The dialect-dependent acoustic events are mapped to the Gaussian mixtures from the discriminative model. As shown in Fig. 4(c), the confused region in the models are therefore balanced and included. iteratively adapting the set of parameters (including mean, covariance, and mixture weight of the GMM) using gradient probabilistic descent (GPD) [3], [18] according to (11) C. Minimum Classification Error (MCE) on GMM Training A common way to deal with the confused region in the model is to reduce it by separating the models. MCE is suitable in this case [4], [17], [18]. MCE training separates the models as far away as possible so the confused region is reduced [see Fig. 4(d)]. Let us define the discriminant function as the log likelihood function in the GMM as follows: (6) Here, is the observation sequence of , is the GMM of th dialect class, the number of dialect classes is , is the number of mixtures in the GMM, is the observation at time , is the total number of observations, and is the weighted density of Gaussian component generating the speech frame and is defined in (2). The misclassification measure is defined as (7) where the anti-discriminant function is the log likelihood of the most competing class. The misclassification measure is transformed into a smooth loss function by using is a sigmoid function, and the zero-one loss function defined as follows: (8) where is a slope coefficient of the sigmoid function, which controls the steepness of the sigmoid function and is a positive number. The larger the value of , the steeper the sigmoid funcis a large negative number, which indicates tion. When has a value close correct recognition, the loss function to zero, which implies that no loss was incurred. On the other is a positive number, the loss function hand, when has a value between 0.5 and 1 which indicates the occurrence of an error. The overall empirical loss associated with the given observation sequence is given by (9) where is a Boolean function which will return 1 if is from class and 0 otherwise. Thus, the objective of the first stage is to minimize the expected loss function defined as (10) The parameter here can be estimated by first choosing an initial estimate (a MLE-trained GMM is used in our study) and where is the learning rate, is the gradient of the loss function, and is the iteration count. This represents the MCE-GMM training scheme for dialect classification in Fig. 4(d). D. MS/FS After MCE on GMM Training MCE is a technique to separate the acoustic spaces represented by the GMMs as far as possible in the minimum classification error criterion. It will change all the parameter values in the GMMs (mean, covariance, and mixture weight), but it does not necessarily mean that a partial list of Gaussian mixtures in the MCE-trained GMM is sufficient for classification. Actually, the whole list of Gaussian mixtures in the MCE-trained GMM should be present for classification, since this is the domain MCE criterion is applied. MCE training can reduce the confused region as shown in Fig. 4(d) by pulling the models away from each other. However, the confused regions are still present. Therefore, MS or FS training can be applied after the MCE training. Since MS or FS training removes or balances the confused regions, applying MCE after MS or FS is not favorable. If MS training or FS training is applied after MCE training, the excluded region or shared region in the new GMMs after MCE training will be smaller than directly applying MS training or FS training on the original models. In Fig. 4(e) and (f), the excluded/shared region is smaller than the excluded/shared region in Fig. 4(b) and (c). Since the model structures from Fig. 4(e) and (f) are changed less, we expect that applying MS or FS after MCE training (i.e., MCE-MS or MCE-FS) will have less impact on the models than directly applying MS or FS, so that MCE-MS or MCE-FS training can achieve better performance than MS or FS training alone. Certainly, applying MS or FS after MCE will usually not achieve additional improvement from MCE than applying MS or FS directly after the ML trained models since the MCE training has increased the model discriminability to some extent. These arguments will be investigated in Section IV. IV. EXPERIMENTS In this section, the proposed algorithms are evaluated using dialect data from English and Spanish. There are no associated transcripts for the training or test data in all languages and dialects. The English data and Spanish data are spontaneous speech, (i.e., one person from the dialect region talking spontaneously). The feature used in our study is the well-known Mel-frequency cepstral coefficients (MFCCs) represented as . The plus their delta coefficients are used. log energy and In total, 26 coefficients are used to represent each frame. The window size for a frame is 20 ms, with a skip rate of 10 ms. Therefore, the frame rate is 100 frames per second. The static training scheme is applied for MLE, MS, FS, and MCE GMM training, (i.e., all the training data is presented to HUANG AND HANSEN: UNSUPERVISED DISCRIMINATIVE TRAINING WITH APPLICATION TO DIALECT CLASSIFICATION 2449 TABLE I THREE-DIALECT SPANISH CORPUS USED IN OUR STUDY the system at training). In the evaluation, the large test audio files are partitioned into small utterances. The test utterance length is 10 s in duration unless otherwise specified. The final classification performance is the average on all test utterances. A. Dialect Corpora: Spanish and English The first corpus used in our study consists of Latin American Spanish dialect speech from Cuba, Peru and Puerto Rico, which is partially described in [31]. The spontaneous speech portion was recorded in an interview style. The interviewer gave sample topics such as “describe your family”, and the subject would respond. The interviewer would also give hints to keep the subject talking. The subject used a head-mounted microphone, which also captured the speech from the interviewer at a much lower amplitude since the interviewer sat far away from the microphone. The speech from both the interviewer and the subject were recorded on the same channel. We note that there were also long periods of silence in the audio. To address the above issues, we employ a silence remover to eliminate long silence and potential speech from the interviewer. The silence remover is based on an overall energy measure. Table I summarizes training and test data after the silence removal process. PR represents the dialect from Puerto Rico. The speakers used for training and testing are roughly balanced between male and female speakers. Since the size of the corpus is limited, we do not set aside data for development use, and therefore all of the data is used either in training or test. We will illustrate several combinations of parameters in the experimental results. In an actual application, a development data set can be applied to select the best parameters. The second corpus used is IViE [28], which consists of eight British English dialects. We select three dialects for use in our study based on geographical origins, since we have three dialects for Spanish. These dialects are from Belfast (Northern Ireland), Cambridge (England), and Cardiff (Wales). In each dialect, there are four male and four female speakers for training; and two male and two female speakers for testing. The training data is 57 minutes in total duration; the test data is 30 min in total duration. Therefore, the IViE corpus is a relatively small size corpus. B. Probe Experiments For the proposed algorithms, there are several parameter settings required before formal evaluation. These parameters include the discriminative measure in Section III-A, the range for mixture selection, sigmoid steepness coefficient , learning rate in MCE training and the number of mixtures for GMM training. Since the IViE corpus is too small to be representative, Fig. 7. Gaussian mixture sorting using the four discriminative measures. (a) Based on raw frame count (RFC). (b) Based on normalized frame count (NFC). (c) Based on raw probability score (RPS). (d) Based on normalized probability score (NPS). Fig. 8. Classification accuracy of MS-GMM by selecting different percentage of discriminative mixtures. -axis: percentage of discriminative mixtures being selected (from 10% to 100%. When selecting 100% mixtures, it is the baseline model); -axis: classification accuracy (%). Y X Spanish data is used here for probe experiments, with settings held constant for both English and Spanish corpora. The first experiment is to determine which discriminative measure in Section III-A can sort the mixtures consistently for both training and unseen test data. Therefore, the most confusing mixtures can be excluded and a new set of GMMs generated. Fig. 7 shows mixture sorting using the four discriminative measures described in Section III-A (RFC, NFC, RPS, and NPS). The points in the figures are the numerical labels of mixtures. Different shapes of points mean different dialect and axes represent the number of mixtures. classes. The We use 200 mixtures for the baseline GMM training in Figs. 7, 8 and Table II. Fig. 7 is generated by first using the training data as the input to sort the mixtures of the baseline GMMs and 2450 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007 TABLE II CLASSIFICATION ACCURACY OF THE FOUR MS-GMM FORMULATED SCHEMES; HERE THE TOP 75% OF THE MIXTURES ARE SELECTED obtain new GMMs, then using the test data as the input to sort the mixtures of the new GMMs, the sequence of mixtures are drawn in the figure. If the top discriminative mixtures in the training data are also the top discriminative mixtures in the test . If data, the points in the figure will cluster near the line the top discriminative mixtures in the training data are the least discriminative mixtures in the test data, the points will cluster . The ideal case is that all points will along the line . From Fig. 7, we observe that mixture fall on the line sorting based on the normalized probability score can identify the top discriminative mixtures in a consistent manner for both training and test data. It also shows that the development data is not necessary for mixture sorting, since the Gaussian mixtures can be sorted using the training data for the original GMMs. Table II shows the classification accuracy of the mixture selected GMM (MS-GMM) with the four discriminative measures. We pick the top 75% of the mixtures for the new GMMs. We observe that all four mixture selection schemes can improve dialect classification accuracy. The normalized probability score is the best scheme for sorting the mixtures. We will use this scheme for the following experiments. Next, we show how the percentage of selected mixtures affects classification performance. Fig. 8 shows classification accuracy of MS-GMM formulated by selecting different percentages of the top mixtures. When the top 100% of the mixtures are selected, the system reduces to the baseline GMM. From Fig. 8, we observe that: 1) we must retain at least 50% of the original mixtures and 2) the classification performance improves if the least discriminative mixtures (i.e., the most confusing mixtures) are removed. We also train 300, 600, and 1000 size mixtures for the baseline GMMs, and the top 50% to 100% of the mixtures are selected based on the normalized probability score measure. The results are shown in Fig. 9. When 100% of the Gaussian mixtures are selected, the system reduces to the baseline GMM. The classification accuracy of the baseline 300, 600, and 1000 mixture GMM is 73.7%, 74.3%, and 73.7%, respectively, as shown in Fig. 9 (i.e., the case where the percentage of mixtures selected is 100%). We observe several interesting results: 1) when selecting 50% to 95% of the sorted mixtures, the new GMMs outperform the baseline GMMs and 2) if more mixtures are in the baseline GMMs, a smaller portion of the sorted mixtures need be retained to obtain the best GMMs. In Fig. 9, we observe that when 65% of the mixtures are selected for the 1000-mixture baseline GMMs, the new GMMs achieve the best performance; for the 600-mixture GMMs and 300-mixture GMMs, the new GMMs achieve the best performance when 75% and 80% of the mixtures are selected, respectively. There are several parameter settings required for MCE training. Based on the previous experiments [4], [17], [18] and our experience, the learning rate for MCE training should be less than one and scaled down gradually. We set the initial learning Fig. 9. Classification accuracy of the MS-GMM by selecting different percentage of discriminative mixtures. The initial GMM has 300, 600, and 1000 mixtures, respectively. X -axis: percentage of discriminative mixtures being selected (from 50% to 100%. When selecting 100% mixtures, it is the baseline model); Y -axis: classification accuracy (%). TABLE III SPANISH DIALECT CLASSIFICATION PERFORMANCE OF MCE TRAINED GMM CLASSIFIER USING DIFFERENT VALUES IN THE SIGMOID FUNCTION rate to 0.5, and the learning rate is scaled down by 0.8 after every third iteration. We determined that the sigmoid steepness coefficient in (8) is sensitive to overall performance. Table III shows dialect classification performance when different values of are used in MCE training. Here, the 200-mixture GMM is used in this experiment. The baseline classification accuracy is 73.5%, as shown in Table II. The initial learning rate is 0.5 and the scale-down factor is 0.8. From Table III, we find that is a good value for MCE training. It is also observed that updating the mixture variance in the GMMs is also useful for dialect classification. Therefore, the final parameters for is 100, initial MCE training in our study are as follows: learning rate is 0.5, scale-down factor is 0.8, the number of iterations is set to 10, and variance updating is employed. The last probe experiment is to test how well the FS technique will help improve the GMM-based classifier. The FS-based GMM (FS-GMM) training method will produce a discriminative model and a garbage model. The final model is obtained by combining the discriminative model and garbage model with predefined prior probabilities. Here, we still use the 300, 600, and 1000 mixtures for the baseline GMMs in this experiment. As shown in Fig. 9, the 300, 600, and 1000-mixture MLE-GMM-based classifiers achieve 73.7%, 74.3%, and 73.7% accuracy, respectively. As for the newly trained GMMs, the discriminative model has the same number of mixtures as the baseline GMMs, and the garbage model has one-third the number of mixtures as the baseline GMMs. Fig. 10 shows the Spanish dialect classification accuracy of the combined models generated with different prior probabilities. From Fig. 10, we HUANG AND HANSEN: UNSUPERVISED DISCRIMINATIVE TRAINING WITH APPLICATION TO DIALECT CLASSIFICATION Fig. 10. Spanish dialect classification accuracy of the FS-GMM by setting different prior probabilities for discriminative model and garbage model. The initial GMM has 300, 600, and 1000 mixtures, respectively. X -axis: prior probability for the discriminative model in the combination (from x = 0:05 to 1), the prior probability for the garbage model is 1 x; Y -axis: classification accuracy (%). 2451 0 observe several interesting results: 1) the discriminative model alone only achieves marginal improvement (i.e., the prior probability is 1 for the discriminative model in Fig. 10) and 2) the “garbage” model does help dialect classification. Fig. 11. Classification accuracy of the proposed algorithms on English [(a) and (b)] and Spanish [(c) and (d)] dialect corpora. Evaluated algorithms include MLE-GMM, MS-GMM, FS-GMM, MCE-GMM, MCE-MS-GMM, and MCE-FS-GMM. TABLE IV CLASSIFICATION ACCURACY (%) OF PROPOSED ALGORITHMS ON THE TWO CORPORA (SUMMARIZED VERSION OF FIG. 11) C. Evaluation on English and Spanish Dialect Corpora Having established the parameter settings for MS-GMM, FS-GMM, and MCE-GMM, we now turn to formal evaluation on the two dialect corpora: English and Spanish. From Section IV-B, it is observed that the mixture number for the GMM does not significantly impact performance if the value is in a reasonable range (say, from 300 to 1000). We use 600 mixtures in the GMMs for our evaluation here. The settings for MCE are the same as in Section IV-B. Fig. 11 shows classification accuracy of the proposed algorithms on the two dialect corpora. Fig. 11(a) and (b) uses English dialects, while Fig. 11(c) and (d) uses Spanish dialects. All six algorithms shown in Fig. 4, Sections II, and III are evaluated (MLE-GMM, MS-GMM, FS-GMM, MCE-GMM, MCE-MS-GMM, and MCE-FS-GMM). In Fig. 11, MS-GMM means classifying using the Mixture Selection (MS) trained GMMs (see Fig. 4(b) and Section III-A); FS-GMM means classifying using the Frame Selection (FS) trained GMMs (see Fig. 4(c) and Section III-B); MCE-MS-GMM means MS applied on the MCE trained GMMs (see Fig. 4(e) and Section III-D); MCE-FS-GMM means FS applied on the MCE trained GMMs (see Fig. 4(f) and Section III-D). The -axis of the left-part sub-figures means percentage of top discriminative mixtures being selected (from 50% to 100%). When 100% of the mixtures are selected, MS-GMM actually degenerates to the baseline GMM (i.e., MLE-GMM, see Fig. 4(a) and Section II), and MCE-MS-GMM degenerates to MCE-GMM (see Fig. 4(d) and Section III-C). The -axis of the right-part sub-figures means prior probability for the discriminative model in the combination (from to 1), the prior . The -axis of all probability for the garbage model is the sub-figures is classification accuracy (%). From Fig. 11, we find that MS-GMM and FS-GMM achieve significant improvement in dialect classification. MCE training also improves classification performance. When MS or FS techniques are applied after MCE training, further improvement is achieved. Table IV summarizes the best performance from Fig. 11 for MS-GMM, FS-GMM, MCE-MS-GMM, and MCE-FS-GMM. From Table IV, the best systems achieve 43.7% and 16.4% relative error reduction versus the baseline MLE-GMM system on English and Spanish dialect corpora, respectively. D. Comparison With the Human Listener Test It is interesting to assess how well machine performance compares with human performance. From our prior research, Ikeno and Hansen [15] conducted a listener test using the same three British dialects data from the IViE corpus. They recruited 11 listeners from the United Kingdom (U.K.), 11 listeners from the United States (U.S.), and 11 listeners from non-English speaking countries. All listeners were either college students or faculty, and therefore highly educated with potentially some exposure to British dialects, especially for those listeners originally from the U.K.. An approximately 60-s-long audio file per 2452 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 8, NOVEMBER 2007 TABLE V FS-GMM SYSTEM VERSUS LISTENERS FROM U.K. ON THE THREE BRITISH DIALECTS dialect was provided for listener training and reference. These audio files were always available for reference during testing. Each test utterance had nine words on average, 3 s long in duration. From [15], the listeners from U.K. perform much better than listeners from the U.S. or non-English speaking countries. The classification accuracy of British listeners, American listeners, and non-native listeners is 82.3%, 55.7%, and 45.0% respectively. We evaluate the proposed automatic dialect classification system using the same 3-s-long test utterances for comparison. Table V shows the comparison of listeners from U.K., the MLE-GMM (i.e., baseline GMM system), and proposed FS-GMM system. The number of mixtures for the GMM is held the same at 600. The performance of FS-GMM listed in Table V can be achieved by setting the prior probability of discriminative GMM from 0.5 to 0.7 in FS-GMM. From Table V, we observe that the MLE-trained GMM-based classifier achieves comparable performance versus human, with British listeners having more trouble distinguishing Cardiff dialect, and MLE-GMM having more difficulty with Belfast dialect. The proposed FS-GMM system clearly outperforms the human listeners (7.4% absolute improvement) and MLE-GMM system (6.8% absolute improvement), with consistent improvement for Belfast versus MLE-GMM, and for Cardiff versus British listeners. This suggests that the difficulties in dialect perception of Cardiff for British listeners and modeling limitations for Belfast dialect for MLE-GMM are simultaneously addressed with FS-GMM. V. CONCLUSION Speech produced under different dialects of a language is represented using a range of acoustic events. These acoustic events can be separated into dialect discriminating content, and dialect neutral/distractive content. In the Gaussian mixture model space, the dialect distractive content is the confused region of the models, which can be excluded, included, or reduced. Three corresponding techniques were developed in this study named: Mixture Selection (MS), Frame Selection (FS), and MCE (i.e., MS-GMM, FS-GMM, and MCE-GMM). The MLE-trained GMM-based system representing the baseline system is easy to implement and can achieve reasonable performance (see Section IV-D). The GMM trained with MS or FS achieves measurable performance improvement over the MLE trained GMM system. The MCE-trained system achieves similar performance compared with MS or FS trained systems. However, MS or FS training has a lower computational load than MCE training. Therefore, MS or FS training is a better alternative than MCE training for GMM-based dialect classification if fast computation is needed. During the test, the computational load for MS, FS, and MCE is almost the same. Interestingly, MS or FS training and MCE training can be performed sequentially, resulting in further improvement. Evaluated on English dialects and Spanish dialects, the proposed MCE-MS-GMM or MCE-FS-GMM methods achieve 43.7% and 16.4% relative error reduction versus the MLE-GMM system on English and Spanish corpora respectively. The FS-GMM system achieves 41.8% relative error reduction comparing to the (knowledgeable) human listeners in British dialect classification. Therefore, the proposed automatic dialect classification systems will be beneficial in real-world speech applications. REFERENCES [1] B. Lukas, M. Pavel, and C. Jan, “Discriminative training techniques for acoustic language identification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Toulouse, France, 2006, pp. 209–212. [2] D. L. Canfield, Spanish Pronunciation in the Americas. Chicago, IL: Univ. Chicago Press, 1981. [3] E. W. Cheney, Introduction to Approximation Theory. New York: AMS/Chelsea, 1982. [4] W. Chou, “Discriminant-function-based minimum recognition error rate pattern-recognition approach to speech recognition,” Proc. IEEE, vol. 88, no. 8, pp. 1201–1223, Aug. 2000. [5] D. Crystal, A Dictionary of Linguistics and Phonetics. Malden, MA: Blackwell, 1997. [6] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc., vol. 39, pp. 1–38, 1977. [7] V. Diakoloukas, V. Digalakis, L. Neumeyer, and J. Kaja, “Development of dialect-specific speech recognizers using adaptation methods,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Munich, Germany, Apr. 1997, vol. 2, pp. 1455–1458. [8] R. O. Duda, P. E. Hart, and D. G. P. C. Stork, Pattern Classification, 2nd ed. New York: Wiley, 2001. [9] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression, 6th ed. Norwell, MA: Kluwer, 1991. [10] V. Gupta and P. Mermelstein, “Effect of speaker accent on the performance of a speaker-independent isolated word recognition,” J. Acoust. Soc. Amer., vol. 71, pp. 1581–1587, 1982. [11] C. Huang, T. Chen, S. Li, E. Chang, and J.-L. Zhou, “Analysis of speaker variability,” in Proc. Eur. Conf. Speech Commun. Technol., Aalborg, Denmark, Sep. 2001, vol. 2, pp. 1377–1380. [12] R. Huang and J. H. L. Hansen, “Advances in word based dialect/accent classification,” in Proc. Interspeech-Eurospeech, Lisbon, Portugal, Sep. 2005, pp. 2241–2244. [13] R. Huang and J. H. L. Hansen, “Dialect/accent classification via boosted word modeling,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., Philadelphia, PA, Mar. 2005, vol. 1, pp. 585–588. [14] J. J. Humphries and P. C. Woodland, “The use of accent-specific pronunciation dictionaries in acoustic model training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Seattle, WA, May 1998, vol. 1, pp. 317–320. [15] A. Ikeno and J. H. L. Hansen, “Perceptual recognition cues in native English accent variation: Listener accent perceived accent, and comprehension,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Toulouse, France, May 2006, pp. I-401–I–404. [16] A. S. Jayram, V. Ramasubramanian, and T. Sreenivas, “Language identification using parallel sub-word recognition,” in Pro. IEEE Int. Conf. Acoust., Speech, Signal Process., Hong Kong, China, Apr. 2003, vol. 1, pp. 32–35. [17] B.-H. Juang, W. Chou, and C.-H. Lee, “Minimum classification error rate methods for speech recognition,” IEEE Trans. Speech Audio Process., vol. 5, no. 3, pp. 257–265, May 1997. [18] B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Trans. Signal Process., vol. 40, no. 12, pp. 3043–3054, Dec. 1992. [19] J. M. Lipski, Latin American Spanish. London, U.K.: Longman, 1994. [20] M.-K. Liu, B. Xu, T.-Y. Huang, Y.-G. Deng, and C.-R. Li, “Mandarin accent adaptation based on context-independent/ context-dependent pronunciation modeling,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Istanbul, Turkey, Jun. 2000, vol. 2, pp. 1025–1028. HUANG AND HANSEN: UNSUPERVISED DISCRIMINATIVE TRAINING WITH APPLICATION TO DIALECT CLASSIFICATION [21] M. Pavel, B. Luk, S. Petr, and C. Jan, “NIST language recognition evaluation 2005,” in Proc. NIST LRE 2005, Washington, DC, 2006, pp. 1–37. [22] A. Nadas, “A decision-theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-4, no. 4, pp. 814–817, Aug. 1983. [23] D. Povey, “Discriminative training for large vocabulary speech recognition,” Ph.D. dissertation, Cambridge Univ., Cambridge, U.K., 2004. [24] D. Reynolds and R. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Speech Audio Process., vol. 3, no. 1, pp. 72–83, Jan. 1995. [25] D. A. Reynolds, T. F. Quatieri, and R. B. D. Speaker, “Verification using adapted Gaussian mixture models,” Dig. Signal Process., vol. 10, no. 1–3, pp. 19–41, Jan./Apr./Jul. 2000. [26] P. A. Torres-Carrasquillo, T. P. Gleason, and D. A. Reynolds, “Dialect identification using Gaussian mixture models,” in Proc. Odyssey: The Speaker and Language Recognition Workshop, 2004, pp. 297–300. [27] W. Ward, H. Krech, X. Yu, K. Herold, G. Figgs, A. Ikeno, D. Jurafsky, and W. Byrne, “Mandarin accent adaptation based on context-independent/ context-dependent pronunciation modeling,” in Proc. ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, Estes Park, CO, Sep. 2002. [28] Web: IViE-British Dialect Corpus [Online]. Available: http://www. phon.ox.ac.uk/esther/ivyweb/ [29] J. C. Wells, Accents of English. Cambridge, U.K.: Cambridge Univ. Press, 1982, vol. I, II, III. [30] L. R. Yanguas, G. C. O’Leary, and M. A. Zissman, “Incorporating linguistic knowledge into automatic dialect identification of Spanish,” in Proc. Int. Conf. Spoken Lang. Process., Syndey, Australia, Nov. 1998. [31] M. A. Zissman, T. P. Gleason, D. M. Rekart, and B. L. Losiewicz, “Automatic dialect identification of extemporaneous conversational, Latin American Spanish speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1996, vol. 2, pp. 777–780. [32] P. Angkititrakul and J. H. L. Hansen, “Advances in phone-based modeling for automatic accent classification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 2, pp. 634–646, Mar. 2006. Rongqing Huang (S’01–M’06) was born in China in 1979. He received the B.S. degree from the University of Science and Technology of China (USTC), Hefei, China, in 2002, and the M.S. and Ph.D. degrees from the University of Colorado at Boulder in 2004 and 2006, respectively, all in electrical engineering. In 2006, he joined the Acoustic Modeling Group, Dragon and Healthcare R&D, Nuance Communications, Burlington, MA, as a Research Scientist. He works on large vocabulary acoustic modeling for network, medical, and other Nuance speech recognition products. From 2000 to 2002, he worked in the USTC iFlyTek Speech Lab. From 2002 to 2005, he worked in the Robust Speech Processing Group, Center for Spoken Language Research, University of Colorado at Boulder. He was a Ph.D. Research Assistant in the Department of Electrical and Computer Engineering. In 2005, he was a summer intern with Motorola Labs, Schaumburg, IL. He was a research intern in the Center for Robust Speech Systems at the University of Texas at Dallas from 2005 to 2006. His research interests include the general areas of large vocabulary speech recognition, machine learning, and robust signal processing. 2453 John H. L. Hansen (S’81–M’82–SM’93–F’07) received the B.S.E.E. degree from the College of Engineering, Rutgers University, New Brunswick, NJ, in 1982 and the M.S. and Ph.D. degrees in electrical engineering from Georgia Institute of Technology, Atlanta, in 1983 and 1988, respectively. He joined the Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas (UTD), Richardson, in the fall of 2005, where he is Professor and Department Chairman of Electrical Engineering, and holds the Distinguished University Chair in Telecommunications Engineering. He also holds a joint appointment as Professor in the School of Brain and Behavioral Sciences (Speech and Hearing). At UTD, he established the Center for Robust Speech Systems (CRSS), which is part of the Human Language Technology Research Institute. In 1988, he established the Robust Speech Processing Laboratory (RSPL) and continues to direct research activities in the CRSS at UTD. Previously, he served as Department Chairman and Professor in the Department of Speech, Language, and Hearing Sciences (SLHS) and as Professor in the Department of Electrical and Computer Engineering, both at the University of Colorado at Boulder (1998–2005), where he cofounded the Center for Spoken Language Research. He has also served as a Technical Advisor to the U.S. Delegate for NATO (IST/TG-01), His research interests span the areas of digital speech processing, analysis, and modeling of speech and speaker traits, speech enhancement, feature estimation in noise, robust speech recognition with emphasis on spoken document retrieval, and in-vehicle interactive systems for hands-free human–computer interaction. He has supervised 39 (18 Ph.D., 21 M.S.) thesis candidates and is the author/coauthor of 251 journal and conference papers in the field of speech processing and communications. He was the coauthor of the textbook Discrete-Time Processing of Speech Signals (IEEE Press, 2000), coeditor of DSP for In-Vehicle and Mobile Systems (Springer, 2004), Advances for In-Vehicle and Mobile Systems: Challenges for International Standards (Springer, 2007), and lead author of the report “The Impact of Speech Under ’Stress’ on Military Speech Technology” (NATO RTO-TR-10, 2000). Dr. Hansen has served as the IEEE Signal Processing Society Distinguished Lecturer for 2005/2006, and as a Member of the IEEE Signal Processing Society Speech Technical Committee and Educational Technical Committee, the Speech Communications Technical Committee for the Acoustical Society of America (2000–2003), and the International Speech Communications Association (ISCA) Advisory Council (2004–2010). He has served as Associate Editor for the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING (1992–1999), IEEE SIGNAL PROCESSING LETTERS (1998–2000), and as an Editorial Board Member for the IEEE Signal Processing Magazine (2001–2003). He has also served as Guest Editor of the October 1994 Special Issue on Robust Speech Recognition for the IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING. He was the recipient of the 2005 University of Colorado Teacher Recognition Award as voted by the student body. He also organized and served as General Chair for ICSLP-2002: International Conference on Spoken Language Processing and will serve as Technical Program Chair for IEEE ICASSP-2010, to be held in Dallas, TX.

JP-53-IEEE-ASLP-Huan.. - The University of Texas at Dallas

Related documents

Products

Support

JP-53-IEEE-ASLP-Huan.. - The University of Texas at Dallas

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib