A LINGUISTICALLY-INFORMATIVE APPROACH TO DIALECT

advertisement
A LINGUISTICALLY-INFORMATIVE APPROACH TO DIALECT RECOGNITION USING
DIALECT-DISCRIMINATING CONTEXT-DEPENDENT PHONETIC MODELS*
Nancy F. Chen, Wade Shen, Joseph P. Campbell
MIT Lincoln Laboratory, Lexington, MA, USA
nancyc@mit.edu, swade@ll.mit.edu, jpc@ll.mit.edu
ABSTRACT
We propose supervised and unsupervised learning algorithms to
extract dialect discriminating phonetic rules and use these rules
to adapt biphones to identify dialects. Despite many challenges
(e.g., sub-dialect issues and no word transcriptions), we discovered dialect discriminating biphones compatible with the linguistic
literature, while outperforming a baseline monophone system by
7.5% (relative). Our proposed dialect discriminating biphone system achieves similar performance to a baseline all-biphone system
despite using 25% fewer biphone models. In addition, our system
complements PRLM (Phone Recognition followed by Language
Modeling), verified by obtaining relative gains of 15-29% when
fused with PRLM. Our work is an encouraging first step towards a
linguistically-informative dialect recognition system, with potential
applications in forensic phonetics, accent training, and language
learning.
Index Terms— dialect recognition, phonetic rules
1. INTRODUCTION
Dialect is an important, yet complicated, aspect of speaker variability. A dialect is the language characteristics of a particular population, where the categorization is primarily regional. Linguistic
studies in dialects are limited to small-scale populations due to timeconsuming manual analyses [1]. While these studies provide insight
into how humans identify dialects, large-scale population testing is
required to establish statistical significance.
On the other hand, automatic dialect recognizers are typically
trained on many speakers and efficiently process large amounts
of data. However, current automatic dialect recognizers do not
explicitly model the underlying rules that define dialects, making
it difficult to interpret system models and recognition results in a
linguistically-meaningful manner.
In forensic phonetics, it is important that recognition results of
a dialect recognizer are justifiable on linguistic grounds [2]. The
goal of this work is thus to bridge the gap between the linguistic and
engineering approaches by constructing an informative system that
*This work is sponsored by the Command, Control and Interoperability
Division (CID), which is housed within the Department of Homeland Security’s Science and Technology Directorate under Air Force Contract FA872105-C-0002. Opinions, interpretations, conclusions and recommendations are
those of the authors and are not necessarily endorsed by the United States
Government. Nancy Chen is also supported by the NIH Ruth L. Kirschstein
National Research Award.
978-1-4244-4296-6/10/$25.00 ©2010 IEEE
5014
discovers dialect-discriminating pronunciation patterns and explicitly uses them to recognize dialects. These dialect-specific patterns
are phonetic rules interpretable by humans. Many of these rules
are phonetic-context dependent; e.g, /w/ in Indian English is often
dropped when in the middle of a word, so “towel” sounds like “tall”.
There are many challenges in designing an informative dialect
recognizer that incorporates phonetic rules. (1) Standard phone recognizers inadequately capture acoustic differences across dialects.
For example, in Indian English there is retroflex /d/, but a typical
phone recognizer may not capture the retroflexness. The retroflex
/d/ might be recognized as a typical /d/, /r/, or some phone less interpretable. (2) State-of-the-art phone recognizers are still far from
perfect when there are no word transcriptions to force-align to. As
a result, transcriptions generated by phone recognizers are noisy estimates of the true acoustic units. Though these limitations make
it challenging to interpret dialect-specific phonetic rules, our work
demonstrates a first step towards an informative dialect recognizer.
Some researchers have investigated phonetic variations between
native and non-native speech (e.g. [3, 4]), focusing on phoneticcontext independent rules, and on improving speech recognition. [5]
modeled trajectories of acoustic observations to classify accents, but
these models are difficult to interpret at the phonetic level as the
observations are multi-dimensional. [6] analyzes accents with articulatory models, which complements phoneitc analyses in our work.
Many techniques in dialect identification are ported from language identification [7]. These methods usually fall into two categories: (1) Acoustic modeling which often uses Gaussian mixture
models (GMM) or Hidden Markov Models (HMM) [8, 9, 10, 11];
(2) Language modeling (LM), which captures statistics of phonotactic distributions; e.g., phone recognition followed by language modeling (PRLM) [12]. Since linguistics literature suggests that dialect
differences often occur in certain phonetic contexts [13], we extend
adapted phonetic models (APM) [9] to consider phonetic contexts
and select phonetic contexts that are dialect-discriminating.
The contributions of this paper are multifold: (1) we propose
algorithms to discover dialect-specific context-dependent rules with
and without using transcriptions; (2) despite many challenges in our
dialect recognition task (e.g., sub-dialects in Indian English, lack of
word transcriptions to obtain forced-aligned phones), we discovered
dialect-specific rules compatible with the linguistic literature while
achieving a 7.5% relative gain over a baseline monophone1 system;
1 “Phone” usually refers to “monophone”, where a phone’s surrounding
phones do not affect the identity of the phone of interest; e.g., monophone [t]
is always referred to as [t], regardless of its surrounding phones. A biphone
is a monophone in the context of another monophone; e.g., biphone [t+a] is
ICASSP 2010
(3) our proposed system complements PRLM, verified by obtaining relative gains of 14.6-29.3% in fusion experiments. Our work
is an important step towards bridging linguistic and engineering approaches to analyzing dialects, and is potentially useful in applications such as forensic phonetics and accent training tools.
2. METHODS
We exploit the fact that for a given phone recognizer, biphones have
different recognition precision across dialects due to acoustic differences. Modeling the acoustics of these dialect discriminating biphones allows us to distinguish between dialects.
2.1. Supervised Learning
2.1.1. Selecting Dialect Discriminating Biphone Models
We pass a phonetically-labeled speech sequence from dialect d ∈
{d1 , d2 } through a root phone recognizer λ that models monophones
[9].
Let α be a recognizer-hypothesized phone and g(α) be the
ground-truth phone corresponding to α. Biphone αβ is defined as
the phone α that is followed by phone β. We consider biphone
αβ to be dialect discriminating between dialect d1 and d2 if λ’s
recognition precision of biphone αβ is different for the two dialects:
|P (g(α) = α|αβ , λ, d1 ) − P (g(α) = α|αβ , λ, d2 )| ≥ θ0
(1)
for some preset threshold θ0 and if there is sufficient occurrences of
biphone αβ .
2.2. Unsupervised Learning
We now discuss the case if the ground-truth phones from the pronunciation dictionary are not given as in Section 2.1.
2.2.1. Adapt All Biphone Models
We use the 1-best hypothesis of root phone recognizer λ as groundtruth and then perform adaptation of Eq (3) for all biphones αβ ∈
{biphone set} labeled by λ.
2.2.2. Select Dialect Discriminating Biphone Models
For each dialect d ∈ {d1 , d2 }, we denote yd,αβ to be all the acoustic
observations of biphones labeled as αβ by the root recognizer λ in
the training corpus. Let Td,αβ be the total duration of yd,αβ .
Consider the log likelihood ratios
!
P (yd1 ,αβ |λbi
d1 ,αβ )
1
d1
log
(4)
zαβ =
Td1 ,αβ
P (yd1 ,αβ |λbi
d2 ,αβ )
!
P (yd2 ,αβ |λbi
d1 ,αβ )
1
d2
log
.
(5)
zαβ =
Td2 ,αβ
P (yd2 ,αβ |λbi
d2 ,αβ )
The greater the value of zαd1β , the better the acoustics of biphone
bi
αβ is modeled by λbi
d1 ,αβ relative to λd2 ,αβ . Similarly, the smaller
d2
the value of zαβ , the better the acoustics of biphone αβ is modeled
bi
d1
d2
by λbi
d2 ,αβ relative to λd1 ,αβ . If zαβ is in general greater than zαβ ,
then biphone αβ is dialect discriminating between dialects d1 and
d2 . Therefore, we define biphone αβ to be dialect discriminating if
zαd1β − zαd2β ≥ θ
2.1.2. Adaptation of Dialect Discriminating Biphone Models
(a) Monophone Adaptation. Since the root recognizer λ was
trained on a different corpus from our data, for each monophone α,
we first adapt its acoustic model λ(α) to a dialect specific one [9],
where for d ∈ {d1 , d2 },
adapt
λ(α) −→ λd (α)
(2)
(b) Biphone Adaptation. Using standard monophone to biphone
adaptation methods in standard speech recognition systems, we
adapt the dialect-specific monophone model λd (α) to a dialect specific biphone model λbi
d (αβ ) for dialect discriminating biphones αβ
defined by Eq (1), i.e.,
adapt
λd (α) −→ λbi
d (αβ )
(3)
The adaptation process in Eq (3) is similar to that in Eq (2) except the adaptation data are now observations corresponding to biphone αβ .
2.1.3. Dialect Recognition
Dialect recognition is accomplished via a likelihood ratio test by
using the adapted biphone models to compute per-frame log likelihoods. Non-dialect discriminating biphones that occur in the test utterances are modeled as dialect-specific monophones obtained from
Eq (2).
defined as the monophone [t] followed by monophone [a].
5015
(6)
where θ is the decision threshold.
2.2.3. Dialect Recognition
Dialect recognition is accomplished as in Section 2.1.3.
2.2.4. Remarks
This method can still be used when we have ground-truth phones:
the root recognizer’s 1-best hypotheses in Section 2.2.1 are replaced
with ground-truth g from the dictionary pronunication. The learned
biphones are expected to be more interpretable, since there are no
longer phone recognition errors.
2.3. Comparison with PRLM
In Eq.(1), the biphone recognition precision difference across dialects is generated by an acoustic difference that is dialect-specific.
We use this dialect-specific information directly in the supervised
method (Section 2.1), and implicitly in the unsupervised method
(Section 2.2). In contrast, PRLM models differences in biphone
occurrence frequency, and is consequently blind to dialect-specific
acoustic differences. Therefore, our approach and PRLM complement each other.
3. EXPERIMENTS
Experimental Setup. We evaluated our system on the NIST LRE07
English dialect task [18]. Our system was trained on 104 hours
of Callfriend, LRE05 test set, OGI’s foreign accented English, and
3.1. Pilot Study: Analyzing Dialect-Specific /r/ Biphones
We discuss a pilot study where we analyze hundreds of labeled /r/
using the supervised method in Section 2.1.
40
Miss probability (in %)
LDC’s Mixer and Fisher corpora [14, 15, 16, 17]. The LRE07 test
set includes 80 trials of American English and 160 trials of Indian
English. Since the test set is small, it was difficult to conclusively
interpret detection error trade-off (DET) curves. Therefore, we used
two test sets: (1) Test-LRE07 (the official NIST LRE07 30-second
test trials). Results are obtained by pooling [18] and a backend classifier [19]. (2) Test-Linc: 1,498 trials of 30-second segments from
Mixer and LRE05; 233 trials are Indian English, and 1,265 trials are
American English. Neither of these test sets were included in training. Pooling was not used because it is inappropriate to consider
the classes to be independent in closed-set two-way classification.
Pooling was used in Test-LRE07 for comparison with NIST LRE07.
The baseline APM and PRLM system configurations are the
same as [9]; the root HMM modeled 47 English monophones (3
states/phone, 128 Gaussians/state) using 23 hours of transcribed
words from Switchboard-II phase 4 (Cellular).
monophones
r−biphones
all biphones
proposed biphones
PRLM
PRLM + proposed
PRLM + all biphones
20
10
5
2
1
1
2
5
10
20
False Alarm probability (in %)
40
Fig. 1. DET curve results of Test-Linc.
3.1.1. Rule Extraction Conditioned on Phonetic Context
A small data set was phonetically labeled manually to analyze the
dialect-specific context-dependent phonetic rules of /r/2 . In the Indian English training set, 200 recognizer-hypothesized /r/ were labeled. 50% of them were incorrectly hypothesized, but showed no
bias toward any phone. In the American English training set, 500 hypothesized /r/ were labeled; 80% were correctly hypothesized, and
the remaining showed no bias toward any phone.
We observe that in certain phonetic contexts, the phone recognition precision of /r/ is dialect-specific, supporting the hypothesis
that some biphones are dialect-specific. For example, if the root recognizer hypothesizes a segment to be /r/ followed by /dx/, then the
recognition accuracy of /r/ is 0% for Indian English, but 90% for
American English.
Dialect discriminating /r/ biphones were selected by setting θ0 in
Eq (1) to be one standard deviation above the mean of the empirical
distribution of |P (g(α) = α|αβ , λ, d1 ) − P (g(α) = α|αβ , λ, d2 )|
for all α, β and if the biphone occurs more than the average occurrences of all biphones.
Table 1. System comparison in EER %.
System
Test-LRE07
Test-Linc
monophone APM
13.8
10.5
r-biphones
12.5
10.4
unfiltered biphones
14.7
9.4
filtered biphones (proposed)
15.6
9.7
PRLM
12.8
9.7
PRLM+unfiltered biphones
10.6
6.6
PRLM+proposed
10.9
6.9
3.1.2. Results and Discussion
Dialect-specific /r/ biphones extracted in Section 3.1.1 were adapted
according to Section 2.1.2. /r/ in other contexts and all other phones
were modeled as monophones. The equal error rate (EER) of the rbiphone system improves monophone APM by 9.1% on Test-LRE07
and 1.5% on Test-Linc, (see Table 1). The DET plot of Test-Linc in
Fig. 1 show that the r-biphone system improves detection error of
monophone APM when false alarms are penalized heavily. DET
plots of Test-LRE07 are not shown due to space constraints.
Below are possible reasons why /r/ biphones learned in Section
3.1.1 resulted in inconsistent improvement. (1) The labeled data only
corresponds to 1 min of speech, so the rules may not be general. (2)
The labeled Indian English /r/ were mainly from speakers whose first
language is Hindi, but the first language of speakers in the test set
may be other Indian languages, which may possess different dialectspecific rule. While it is more difficult to resolve (2), as it depends
on documentation of the given corpora, (1) can be resolved by using
more data. Thus, we take an unsupervised approach to study more
data in the next section.
3.2. Unsupervised Learning of Dialect-Specific Biphones
We now use the unsupervised method in Section 2.2 to learn phonetic
rules. We focus our discussion on Test-Linc because it is a larger
data set resulting in more robust results. We will however comment
on Test-LRE07 when possible.
3.2.1. System Setup and Analysis
2 /r/ was chosen because it showed high acoustic likelihood differences
between the American and Indian English adapted monophone recognizers.
5016
We use θ = 0.1 in Eq (6) on Test-Linc to obtain a selected biphone
set. This biphone set is used to evaluate the performance of the dialect discriminating biphone system on Test-LRE07. Similarly, we
use θ = 0.1 in (6) on Test-LRE07 to obtain a selected biphone set for
Test-Linc; the resulting dialect discriminating biphone system uses
Equal error rate (EER) % of Test−Linc
11.2
11
10.8
amount of pruned biphones vs. EER
monophone APM EER performance
proposed biphone system (θ = 0.1)
challenges (e.g., sub-dialect issues and no transcriptions), we discovered dialect-specific biphones compatible with the linguistic
literature, while outperforming a baseline monophone system by
7.5% (relative). These results suggest that if word transcriptions
are provided, we can potentially retrieve more dialect-specific rules,
demonstrating our work is an encouraging first step towards a
linguistically-informative dialect recognizer. In addition, our proposed system obtains relative gains of 14.6-29.3% when fused with
PRLM, indicating it complements phonotactic approaches. Applications of this work include forensic phonetics and language learning.
10.6
10.4
10.2
10
9.8
9.6
9.4
10
15
20
25
30
35
Amount of pruned biphones (%) determined by sweeping θ on Test−LRE07
40
Fig. 2. EER of Test-Linc as a function of pruned biphones determined by Test-LRE07.
ACKNOWLEDGMENTS
The authors thank Jim Glass, Reva Schwartz, and Ken Stevens for
their input.
25% fewer biphone models (2,381 to 1,789) than the all-biphone
system, while achieving comparable performance (EER difference
is 2.9%).
It turns out that 90% of the dialect discriminating /r/ biphones
extracted by the supervised method are retained by the unsupervised
method, rather than 75% which would be expected if the process
were completely random. Henceforth, the results we present use this
value of θ unless stated otherwise.
For the purpose of analysis, we display in Fig. 2 the EER of TestLinc as a function of pruned biphones determined by Test-LRE07.
The lowest EER was obtained when 18% biphones were pruned, so
our choice of θ is not optimal with respect to EER. Another observation is that with 29% pruned biphone models in the dialect discriminating biphone system, EER becomes that of monophone APM.
On the Test-Linc data set, the proposed dialect discriminating
biphone system outperforms the monophone APM system by 7.5%
(see Table 1). However, on Test-LRE07, the monophone APM outperforms the dialect discriminating biphone system. We suspect this
is due to the small test set, because monophone APM even outperforms the all-biphone case.
Discussion of Learned Rules. Though it is challenging to interpret the learned rules because of phone recognition errors, we
find some rules consistent with linguistic knowledge. For example, [dx+r], [dx+axr], and [dx+er] were selected as dialect discriminating biphones. In American English, the intervocalic /t/ is often
implemented as a flap consonant [dx] (e.g. “butter”), while this is
uncommon in Indian English. Since Indian English has British influence, other biphones such as [ae+s] and [ae+th] might be from
words such as “class” and “bath”, which are produced differently in
American and British accents. [aw+l] was also a selected biphone.
In some Indian languages there are no [w], and thus [w] is often
deleted in word-medial positions (e.g. “towel”). Biphones such as
[r+dx] learned in Section 3.1.1, though less interpretable, were also
selected by the unsupervised method.
5. REFERENCES
3.2.2. Fusion with PRLM
To empirically verify if the proposed system complements PRLM as
described in Section 2.3, we fuse our filtered-biphone system with a
PRLM system [9], resulting in relative gains of 14.6% and 29.3% on
Test-LRE07 and Test-Linc with a backend [19]. In addition, the EER
performance of the filtered-biphone and unfiltered-biphone systems
are still comparable after fusing with PRLM (see Table 1 & Fig. 1).
4. CONCLUSION
We present systematic approaches to discover dialect-specific biphones, and use these biphones to identify dialects. Despite many
5017
[1] Labov, W. et al,“The Atlas of North American English: Phonetics, Phonology, and Sound Change,” Mouton de Gruyter, Berlin,
2006.
[2] Rose, P., “Forensic Speaker Identification, Taylor and Francis,
2002.
[3] Fung, P., Liu, Y., “Effects and Modeling of Phonetic and Acoustic Confusions in Accented Speech,” JASA, 118(5):3279-3293,
2005.
[4] Livescu, K., Glass, J., “Lexical Modeling of Non-Native speech
for automatic speech recognition”, ICASSP, 2000.
[5] Angkititrakul, P., Hansen, J., “Advances in phone-based modeling for automatic accent classification,” IEEE TASLP, vol. 14,
pp. 634-646, 2006.
[6] Sangwan, A. and Hansen, J., “On the use of phonological features for automatic accent analysis,” Interspeech 2009.
[7] Zissman, M., “Comparison of Four Approaches to Automatic
Language Identification of Telephone Speech,” IEEE TSAP.,
4(1):31-44, 1996.
[8] Choueiter, G., Zweig, G., Nguyen, P. “An Empirical Study of
Automatic Accent Classification,” ICASSP, 2008.
[9] Shen, W., Chen, N., Reynolds, D., “Dialect Recognition using
Adapted Phonetic Models,” Interspeech, 2008.
[10] Torres-Carrasquillo, P., Gleason, T., Reynolds, D., “Dialect
Identification using Gaussian Mixture Models,” Odyssey 2004.
[11] Huang, R., Hansen, J. “Unsupervised Discriminative Training with Application to Dialect Classification” IEEE TASLP
15(8):2444-2453, 2007.
[12] Zissman, M., Gleason, T., Rekart, D., Losiewicz, B., “Automatic Dialect Identification of Extemporaneous Conversational
Latin American Spanish Speech,” ICASSP 1995.
[13] Wells, J., “Accents of English: Beyond the British Isles,” Cambridge University Press, 1982.
[14] Callfriend
American
English
corpus:
http://www.ldc.upenn.edu/Catalog/docs/LDC96S46
[15] CSLU
foreign-accented
English
corpus:
http://www.cslu.ogi.edu/corpora/fae/
[16] Cieri, C. et al., “The Fisher Corpus: a Resource for the Next
Generations of Speech-to-Text,” LREC, 2004.
[17] Cieri, C. et al., “The Mixer Corpus of Multilingual, Multichannel Speaker Recognition Data,” LREC, 2004.
[18] NIST
Language
Recognition
Evaluation
2007:
http://www.nist.gov/speech/tests/lang/2007
[19] Reynolds, D. et al., “Automatic Language Recognition via
Spectral and Token Based Approaches,” in J. Benesty et al.
(eds.), Springer Handbook of Speech Processing, Springer,
2007.
Download