SSS01-talk - Speech at CMU

advertisement
Knowledge of
Language Origin Improves
Pronunciation Accuracy
Ariadna Font Llitjos
April 13, 2001
Advisor: Alan W Black
Motivation

It is impossible to have a lexicon with
complete coverage, and high proportion of
unknown words are proper names:

In an experiment done by [Black, Lenzo and
Pagel, 1998], when processing the first
section of the WSJ Penn Treebank (about
40,000 words), they found that 4.6% (1775
words) were out of vocabulary words (using
OALD), 76.6% of which are proper names.
Motivation cont.

We need an automatic way of learning
an acceptable pronunciation for OOV
words, most of which are proper names.

General approach: LTS rules (CART)

Specifically, add language probability
information
Data and limits
- 56,000 proper names from the CMUDICT
lexicon with stress [originally from Bell Labs
directory listings, ~20 years ago]
90% training set & 10% test set
- We only looked at the educated native
American English pronunciation of proper
names: e.g. for ‘Van Gogh’, we don’t want our
system to say /F AE1 N G O K/ or /F AE1 N
G O G/, which some people may claim is the
correct way of pronouncing it, but rather the
educated American pronunciation of it:
 /V AE1 N . G OW1/.
Baseline Technique





Decision trees to predict phones based on
letters and their context (n-grams). In English,
letters map to epsilon, a phone or
occasionally two phones:
(a) Monongahela
(b) Pittsburgh
(c) exchange
m ah n oa1 ng g ah hh ey1 l ax
p ih1 t  s b  er g 
ih k-s ch  ey1 n jh 
Allowables (45 –> 101 phones) and
alignments (stress & epsilon misplacements
affect accuracy)
Origin Class info
What does origin class mean?
 - geographic?
 - etymologic? [Church, 2000]
 - language (our 1st approach)
 - data driven (what we really want,
 current work)

LLM for 26 languages
- European Corpus IMC I:
 English, French, German, Spanish,
Croatian, Czech, Danish, Dutch, Estonian,
Hebrew, Italian, Malaysian, Norwegian,
Portuguese, Serbian, Slovenian, Swedish,
Turkish
- using the Corpusbuilder + manually:
Catalan, Chinese, Japanese, Korean,
Polish, Thai, Tamil and other Indian
languages (except for Tamil).
Language Identifier
An implementation of a variation of the
algorithm presented in Canvar, W.B., and
Trenkle J.M. N-Gram-Based Text
Categorization, in Proceedings of 3rd
Annual Symposium on Document
Analysis and Information Retrieval,1994.
Language Identifier cont.
The language identifier creates a LLM on the fly
for the input word (or document) and, for every
trigram in the input, it calculates the probability
of it belonging to all the languages by
multiplying them by the relative frequencies for
those trigrams in each one of the languages
(LLMs)
LI example













./classify.pl -t "Ying Zhang"
chinese-pn:
0.730594870150084
english.train: 0.0525988955766553
german-pn:
0.0506847882275029
british-pn:
0.0378543572677309
german.train:
0.0303455616225699
tamil-pn:
0.029581372574322
french-pn:
0.0201655107720744
spanish-pn:
0.0185146818045872
catalan-pn:
0.0162318631058251
japanese-pn:
0.00851225092810786
french.train:
0.002861385664355
spanish.train: 0.00205446230618505
Indirect use of the Language
Identifier

Instead of building trees explicitly for
each language (data sparseness
problem), we use the results from the
language identification process as
features within the CART build process,
allowing those features to affect the tree
building only when their information is
relevant.
Features for or pronunciation
model

We decided to add to the n-gram featured the
following:
 - most probable language, with its probability
 - 2nd most likely language, with its probability
 - difference between the 2 highest
probabilities



(zysk ( (best-lang slovenian.train) (higher-prob 0.18471)
(2nd-best-lang czech.train)(2nd-higher-prob 0.18428)
(prob-difference 0.00043)))
CART example

((a
((n.n.n.name is 0)






























((n.name is #)
((p.name is e)
((p.p.p.name is #)
((_epsilon_))
((p.p.p.name is c) ((_epsilon_)) ((ax))))
((ax)))
((n.name is y)
((p.p.p.name is #)
((ey1))
((p.p.p.name is 0)
((ey1))
((p.name is w)
((p.p.p.name is e)
((ey1))
((p.p.p.name is t)
((ey))
((p.p.p.name is n)
((2nd-best-lang is "english.train") ((ey)) ((ey1)))
((2nd-best-lang is "czech.train")
((p.p.p.name is d) ((ey1)) ((ey)))
((ey1))))))
((p.name is d)
((2nd-best-lang is "english.train")
((p.p.p.name is l) ((ey)) ((ey1)))
((ey1)))
((p.p.p.name is c)
((ey1))
((2nd-best-lang is "malaysian.train")
((p.p.p.name is m) ((ey1)) ((_epsilon_)))
((2nd-best-lang is "czech.train") ((_epsilon_)) ((ey)))))))))
Results
Lexicons
Letters
PN-base-5
89.02%
Words
54.08%
PN-lang-5
91.23%
61.72%
PN-base-8
90.29%
52.88%
PN-lang-8
90.63%
59.77%
CMUDICT
91.99%
57.80%
ODALD
95.80%
74.56%
Rho’s example
Cepstral’s talking head
 ./oscars-example

User Studies


From the names that both PN-base-8 and
PN-lang-8 got “wrong” (did not exactly match
the CMUDICT pronunciation in the test set), I
selected the ones for which the two models
assigned a different pronunciation (112), and
from those, I selected 20 at random to run
perceived accuracy user studies.
Overall, the perceived accuracy of the PNlang-8 model was 17% higher (PN-lang-8:
46%, PN-base-8: 29%, no preference: 25%).
 … or a 60% relative improvement
Upper bound
UB is determined by:
 - how noisy the data is
 - how much language origin info can
 really help us in this task [ hard to
estimate without having reliably labeled
data]
…
 - what about adding prior probabilities?

Priors





For each language, we could have a prior
probability that would tell us how likely it is to
find a name in that language, independently
of the name. If our model were trained from
newswires data instead of directory listings, it
would be relatively easy to determine such
priors. E.g.:
“Yesterday in Barcelona, the mayor Joan
Clos inaugurated the Forum of Cultures…”,
P(Catalan) = 0.8
P(Spanish) = 0.15
P(all other languages) ~ 0
What I’m working on now

Unsupervised clustering of proper
names taking the pronunciation into
account.

Traditionally, people working on
grapheme to phoneme conversion only
looked at the written words, but not at
the actual pronunciation
Second approach
- Convert a word into a bunch of
features of the form: l1 l2 l3 ph2
 i.e. a letter in context (trigram) and the
phone it is aligned to


- Bottom-up unsupervised clustering
Criterion: merge two clusters unless there
is a clash
Defining clash

Two clusters will merged if the contexts
(trigrams) are different or if, given a
common context, it is aligned to the
same phone on both clusters.

Example
References






- Black, A., Lenzo, K. and Pagel, V. Issues in Building General
Letter to Sound Rules. 3rd ESCA Speech Synthesis Workshop,
pp. 77-80, Jenolan Caves, Australia, 1998
- CMUDICT. Carnegie Mellon Pronunciation Dictionary. 1998.
http://www.speech.cs.cmu.edu/cgibin/cmudict
- Church, K. (2000). Stress Assignment in Letter to Sound rules
for Speech Synthesis (Technical Memoradnum). AT&T Labs –
Research. November 27, 2000.
- Chotimongkol, A. and Black, A. Statistically trained
orthographic to sound models for Thai. Beijing October 2000.
- Tomokiyo, T. Applying Maximum Entropy to English
Grapheme-to-Phoneme Conversion. LTI, CMU. Project for 11744, unpublished. May 9, 2000.
- Ghani R., Jones R. and Mladenic D. Building Minority
Language Corpora by Learning to Generate Web Search
Queries.
Technical
Report
CMU-CALD-01-100,
2001.
http://www.cs.cmu.edu/~TextLearning/corpusbuilder/
Question & Ideas

…
… Thanks
Download