Knowledge of Language Origin Improves Pronunciation Accuracy Ariadna Font Llitjos April 13, 2001 Advisor: Alan W Black Motivation It is impossible to have a lexicon with complete coverage, and high proportion of unknown words are proper names: In an experiment done by [Black, Lenzo and Pagel, 1998], when processing the first section of the WSJ Penn Treebank (about 40,000 words), they found that 4.6% (1775 words) were out of vocabulary words (using OALD), 76.6% of which are proper names. Motivation cont. We need an automatic way of learning an acceptable pronunciation for OOV words, most of which are proper names. General approach: LTS rules (CART) Specifically, add language probability information Data and limits - 56,000 proper names from the CMUDICT lexicon with stress [originally from Bell Labs directory listings, ~20 years ago] 90% training set & 10% test set - We only looked at the educated native American English pronunciation of proper names: e.g. for ‘Van Gogh’, we don’t want our system to say /F AE1 N G O K/ or /F AE1 N G O G/, which some people may claim is the correct way of pronouncing it, but rather the educated American pronunciation of it: /V AE1 N . G OW1/. Baseline Technique Decision trees to predict phones based on letters and their context (n-grams). In English, letters map to epsilon, a phone or occasionally two phones: (a) Monongahela (b) Pittsburgh (c) exchange m ah n oa1 ng g ah hh ey1 l ax p ih1 t s b er g ih k-s ch ey1 n jh Allowables (45 –> 101 phones) and alignments (stress & epsilon misplacements affect accuracy) Origin Class info What does origin class mean? - geographic? - etymologic? [Church, 2000] - language (our 1st approach) - data driven (what we really want, current work) LLM for 26 languages - European Corpus IMC I: English, French, German, Spanish, Croatian, Czech, Danish, Dutch, Estonian, Hebrew, Italian, Malaysian, Norwegian, Portuguese, Serbian, Slovenian, Swedish, Turkish - using the Corpusbuilder + manually: Catalan, Chinese, Japanese, Korean, Polish, Thai, Tamil and other Indian languages (except for Tamil). Language Identifier An implementation of a variation of the algorithm presented in Canvar, W.B., and Trenkle J.M. N-Gram-Based Text Categorization, in Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval,1994. Language Identifier cont. The language identifier creates a LLM on the fly for the input word (or document) and, for every trigram in the input, it calculates the probability of it belonging to all the languages by multiplying them by the relative frequencies for those trigrams in each one of the languages (LLMs) LI example ./classify.pl -t "Ying Zhang" chinese-pn: 0.730594870150084 english.train: 0.0525988955766553 german-pn: 0.0506847882275029 british-pn: 0.0378543572677309 german.train: 0.0303455616225699 tamil-pn: 0.029581372574322 french-pn: 0.0201655107720744 spanish-pn: 0.0185146818045872 catalan-pn: 0.0162318631058251 japanese-pn: 0.00851225092810786 french.train: 0.002861385664355 spanish.train: 0.00205446230618505 Indirect use of the Language Identifier Instead of building trees explicitly for each language (data sparseness problem), we use the results from the language identification process as features within the CART build process, allowing those features to affect the tree building only when their information is relevant. Features for or pronunciation model We decided to add to the n-gram featured the following: - most probable language, with its probability - 2nd most likely language, with its probability - difference between the 2 highest probabilities (zysk ( (best-lang slovenian.train) (higher-prob 0.18471) (2nd-best-lang czech.train)(2nd-higher-prob 0.18428) (prob-difference 0.00043))) CART example ((a ((n.n.n.name is 0) ((n.name is #) ((p.name is e) ((p.p.p.name is #) ((_epsilon_)) ((p.p.p.name is c) ((_epsilon_)) ((ax)))) ((ax))) ((n.name is y) ((p.p.p.name is #) ((ey1)) ((p.p.p.name is 0) ((ey1)) ((p.name is w) ((p.p.p.name is e) ((ey1)) ((p.p.p.name is t) ((ey)) ((p.p.p.name is n) ((2nd-best-lang is "english.train") ((ey)) ((ey1))) ((2nd-best-lang is "czech.train") ((p.p.p.name is d) ((ey1)) ((ey))) ((ey1)))))) ((p.name is d) ((2nd-best-lang is "english.train") ((p.p.p.name is l) ((ey)) ((ey1))) ((ey1))) ((p.p.p.name is c) ((ey1)) ((2nd-best-lang is "malaysian.train") ((p.p.p.name is m) ((ey1)) ((_epsilon_))) ((2nd-best-lang is "czech.train") ((_epsilon_)) ((ey))))))))) Results Lexicons Letters PN-base-5 89.02% Words 54.08% PN-lang-5 91.23% 61.72% PN-base-8 90.29% 52.88% PN-lang-8 90.63% 59.77% CMUDICT 91.99% 57.80% ODALD 95.80% 74.56% Rho’s example Cepstral’s talking head ./oscars-example User Studies From the names that both PN-base-8 and PN-lang-8 got “wrong” (did not exactly match the CMUDICT pronunciation in the test set), I selected the ones for which the two models assigned a different pronunciation (112), and from those, I selected 20 at random to run perceived accuracy user studies. Overall, the perceived accuracy of the PNlang-8 model was 17% higher (PN-lang-8: 46%, PN-base-8: 29%, no preference: 25%). … or a 60% relative improvement Upper bound UB is determined by: - how noisy the data is - how much language origin info can really help us in this task [ hard to estimate without having reliably labeled data] … - what about adding prior probabilities? Priors For each language, we could have a prior probability that would tell us how likely it is to find a name in that language, independently of the name. If our model were trained from newswires data instead of directory listings, it would be relatively easy to determine such priors. E.g.: “Yesterday in Barcelona, the mayor Joan Clos inaugurated the Forum of Cultures…”, P(Catalan) = 0.8 P(Spanish) = 0.15 P(all other languages) ~ 0 What I’m working on now Unsupervised clustering of proper names taking the pronunciation into account. Traditionally, people working on grapheme to phoneme conversion only looked at the written words, but not at the actual pronunciation Second approach - Convert a word into a bunch of features of the form: l1 l2 l3 ph2 i.e. a letter in context (trigram) and the phone it is aligned to - Bottom-up unsupervised clustering Criterion: merge two clusters unless there is a clash Defining clash Two clusters will merged if the contexts (trigrams) are different or if, given a common context, it is aligned to the same phone on both clusters. Example References - Black, A., Lenzo, K. and Pagel, V. Issues in Building General Letter to Sound Rules. 3rd ESCA Speech Synthesis Workshop, pp. 77-80, Jenolan Caves, Australia, 1998 - CMUDICT. Carnegie Mellon Pronunciation Dictionary. 1998. http://www.speech.cs.cmu.edu/cgibin/cmudict - Church, K. (2000). Stress Assignment in Letter to Sound rules for Speech Synthesis (Technical Memoradnum). AT&T Labs – Research. November 27, 2000. - Chotimongkol, A. and Black, A. Statistically trained orthographic to sound models for Thai. Beijing October 2000. - Tomokiyo, T. Applying Maximum Entropy to English Grapheme-to-Phoneme Conversion. LTI, CMU. Project for 11744, unpublished. May 9, 2000. - Ghani R., Jones R. and Mladenic D. Building Minority Language Corpora by Learning to Generate Web Search Queries. Technical Report CMU-CALD-01-100, 2001. http://www.cs.cmu.edu/~TextLearning/corpusbuilder/ Question & Ideas … … Thanks