Data-driven approach to rapid prototyping Xhosa speech synthesis Albert Visagie Justus Roux Centre for Language and Speech Technology Stellenbosch University South Africa Introduction • Japan-South African Intergovernmental Science and Technology Cooperation Programme. • Goals: – Understand what is needed from a linguistic and technology standpoint. – Build a text-analysis front-end. – Experimental platform. Outline • Xhosa: – orthography, – phonetics, – tone • Approach: – Text analysis, – HTS. Xhosa • Xhosa is spoken in South Africa, by about 8 million people. • One of the official languages of South Africa • Writing system is relatively young, and based on English letters. • Many dialects. • Borrowed clicks from Khoisan. Xhosa: Orthography Agglutinative language. Nouns: – 15 classes (including plural & singular). – Nouns affixed for dimunitive. Verbs: – Verbs affixed according to subject, tense, negative etc. Examples: teach: -fundpreacher (teacher): umfundisi u + m(u) + fund + is + i small preacher: umfundisana u + m(u) + fund + is + ana He/she will teach them: uzakubafundisa u + za + ku + ba + fund + is + a Xhosa: Phonetics Consonants: • Implosive /b/ • Ejectives and aspirated versions of stops. • 15 Clicks Vowels • Five basic vowels, including long versions. Xhosa: Tone • According to the literature, it’s a tone language. • High, Low, and Falling tones. • Recent dictionary: has tone marked for root morphemes, rules can be constructed to predict movement under morphological composition. • Recent work: – Downing, Roux, argue for accent. – Kuun: Statistical experiment suggests highly regular structure. • Observed regularity on pitch rises and duration increase gives a simple method to use in a first prototype. Approach Focus on language dependent components: – Build the text analyser, – use an existing synthesiser. Choice: HTS 2.0 – Model driven, trainable synthesiser. – Contains language independent F0 and duration models – Good use of synthesis database by predicting spectrum, F0 and segment duration separately. HTS HTS: Symbolic Features Each segment of audio (HMM state) is labelled according to its linguistic context Examples: • Phonetic context: labels of preceding and following phones. • Parts-of-speech. • Stress or canonical tone. • Counting. Text Analyser Components Components: – Orthographic to phonetic – Morphological analysis – Parts-of-speech – Canonical tone marks Orthographic to Phonetic • The orthography is very young, and highly consistent with the pronunciation. • Hand-written letter-to-sound rewrite rules. • Lexicon for loan words. Morphology • Specially bootstrapped from a Zulu version for this project. • Requires a lexicon of root morphemes. • Works with isolated words. • Ambiguous! • Ideal: root morpheme boundaries, affix types, POS tagger for disambiguation. • Implemented: None Parts-of-Speech • Morphological analysis. • Ideal: POS tagger. • Implemented: Exhaustive lists of closed sets – pronouns, conjunctions, prepositions, etc. Tone • A printed dictionary with canonical tone markings for root morphemes is available. • Rules can be constructed to determine movement of at least High tones, under morphological composition. • Highly regular structure: 3rd-from-last syllable starts high pitch excursion, 2nd-from-last syllable lengthened. • Ideal: Exhaustive specification of set tones • Implemented: Word-level syllable counts (3-1, 2-2, 1-3) Tests • Basic intelligibility test: Listeners asked to transcribe what they hear. – Incomplete phrases. – Two versions of the question set, and natural utterances (recoded) – Mother-tongue and second language speakers. • Impressions: – “He’s from the townships.” – “That’s perfect, there’s nothing wrong with that.” – Also frowns and repeats. Next Steps • • • • Comprehension test? Impressions. Baseline comparative/preference test. Improvements – Question phrases. – Information from morphological analysis. – Canonical tone markings. • Zulu Conclusion • The system worked very well, considering the bare minimum of knowledge currently incorporated. • Data driven approach with HTS well suited to bootstrapping a new language. • Got experimental platform Demos “Ubangele amadoda amaninzi kule lali,” – Natural: – Synthesised: “waqalisa ukunqwenela ukuba nomzi.” – Natural: – Synthesised: Click song: