Using the Segmentation Corpus to define an inventory of concatenative units for Cantonese speech synthesis Wai Yi Peggy WONG Chris BREW Mary E. BECKMAN Linguistics Dept., Ohio State University 222 Oxley Hall, 1712 Neil Ave. Columbus, OH, 43210-1298 USA {pwong, cbrew, mbeckman}@ling.osu.edu Abstract The problem of word segmentation affects all aspects of Chinese language processing, including the development of text-to-speech synthesis systems. In synthesizing a Hong Kong Cantonese text, for example, words must be identified in order to model fusion of coda [p] with initial [h], and other similar effects that differentiate word-internal syllable boundaries from syllable edges that begin or end words. Accurate segmentation is necessary also for developing any list of words large enough to identify the wordinternal cross-syllable sequences that must be recorded to model such effects using concatenated synthesis units. This paper describes our use of the Segmentation Corpus to constrain such units. Introduction What are the best units to use in building a fixed inventory of concatenative units for an unlimited vocabulary text-to-speech (TTS) synthesis system for a language? Given a particular choice of unit type, how large is the inventory of such units for the language, and what is the best way to design materials to cover all or most of these units in one recording session? Are there effects such as prosodically conditioned allophony that cannot be modeled well by the basic unit type? These are questions that can only be answered language by language, and answering them for Cantonese1 poses several interesting challenges. 1 The term “Cantonese” has at least three meanings. Shui-duan CHAN Chinese Language Centre, Hong Kong Polytechnic University Yuk Choi Road, Hung Hom, Kowloon, HONG KONG chsdchan@polyu.edu.hk One major challenge involves the definition of the “word” in Cantonese. As in other varieties of Chinese, morphemes in Cantonese are typically monosyllabic and syllable structure is extremely simple, which might suggest the demi-syllable or even the syllable (Chu & Ching, 1997) as an obvious basic unit. At the same time, however, there are segmental “sandhi” effects that conjoin syllables within a word. For example, when the morpheme 集 zaap6 2 stands as a word alone (meaning ‘to collect’), the [p] is a glottalized and unreleased coda stop, but when the morpheme occurs in the longer word 集合 zaap6hap6 (‘to assemble’), the coda [p] often resyllabifies and fuses with the following [h] to make an initial aspirated stop. Accurate word segmentation at the text analysis level is essential for identifying the domain of such sandhi effects in any full-fledged TTS system, whatever method is used for generating the waveform from the specified pronunciation of the word. A further challenge is to find a way to capture such sandhi effects in systems that use It can refer to the regional standard that has developed in Canton City over the last millennium, to the larger group of dialects centered around Canton City, or to the new urban standard that has emerged in Hong Kong over the 150 years since the port city was founded. We will be using the unmodified term mostly to refer to the third sense, and context should make it clear where we mean one of the other two senses. 2 We use the Jyutping romanization developed by the Linguistics Society of Hong Kong in 1993. See http://www.cpct92.cityu.edu.hk/lshk. concatenative methods for waveform generation. So long as the word is defined, these effects might be encoded in cross-syllabic diphones, which can be added as an exceptional type of unit to augment the inventory of syllables or demi-syllables. Law & Lee (2000) have even proposed replacing the syllable entirely as the basic unit type, using instead a necessarily crosssyllabic unit, the “final-initial combination”3, as the basic unit, augmented with word-initial onsets and word-final rhymes for the transitions out of and into a pause. Whether the cross-syllabic unit is the basic unit or an exceptional unit that augments an inventory of some other basic unit type, however, a comprehensive inventory of units is required. Such an inventory cannot be defined without having recourse to a reasonably large list of polysyllabic words. Since lexicon development for Hong Kong Cantonese lags considerably behind analogous efforts for Mandarin, the only feasible way to get such a list is to segment words from a corpus of texts written in a style that is close to what the TTS system aims to model, and then to provide a pronunciation field for each entry in the list. In other words, word segmentation is a “foremost problem” not just for developing the text analysis component of a speech synthesis system, but also for getting the basic infrastructure necessary to define the units for concatenative waveform generation. This paper reports on research aimed at defining an inventory of concatenative units for Hong Kong Cantonese using the Segmentation Corpus, a lexicon of 33k words extracted from a large corpus of Cantonese newspaper texts. The corpus is described further in Section 2 after an excursus (in Section 1) on the problems posed by the Cantonese writing system. Section 3 outlines facts about Cantonese phonology relevant to choosing the concatenative unit, and Section 4 calculates the number of units that would be necessary to cover all theoretically possible syllables and sequences of syllables. The calculation is done for three models: (1) a mixed model of syllables (as in Chu & Ching, 1997) augmented with cross-syllable diphones, (2) the Law & Lee (2000) model, and (3) a diphone-only model. This section closes by describing how we used the Segmentation Corpus to discover sporadic and systematic phonotactic gaps that can be exploited to reduce the numbers for the diphone-only model. 1 The Cantonese writing system The Cantonese writing system poses unique challenges for developing online lexicons, not all of which are related to the “foremost problem” of word segmentation. These problems stem from the long and rich socio-political history of the dialect, which makes the writing system even less regular than the Mandarin one, even though Cantonese is written primarily with the same logographs (“Chinese characters”). The main problem is that each character has several readings, and the reading cannot always be determined based on the immediate context of the other characters in a multisyllabic word. For some orthographic forms, the variation is stylistic. For example, the word 支援 ‘support’ can be pronounced zi1jyun4 or zi1wun4. 4 But for other orthographic forms, the variation in pronunciation corresponds to different words, with different meanings. For example, the character sequence 正當 writes both the function word zing3dong1 ‘while’ and the content word zing3dong3 ‘proper’. Moreover, some words, such as ko1 ‘to page, telephone’, ge3 (genitive particle), and laa3 (aspect marker), have no 3 The terms “initial” and “final” here are terms adopted from traditional Chinese linguistics to refer to the syllable onset (including zero onset) and rhyme. The rhyme is simply the second demisyllable. The initial could be made to correspond more to the first demi-syllable by differentiating consonant allophones. For example, the different release spectra of [k] before front versus back vowels could be modelled in Law & Lee’s system by recording different units for the onsets of 經 ging1 and 家 gaa1. 4 We have termed this variation “stylistic” although the factors governing the choice are ill understood. The difference between these two pronunciations is not as sociolinguistically salient as the choice of a falling tone [pou53] 煲 ‘to boil’ rather than a high level tone [pou55] for 煲 ‘pot’ (contrasting with [pou55] for 煲 ‘to boil’), where the pronunciation with the falling tone for the noun has a Guangzhou flavor, but there may be sociolinguistic factors governing it. standard written form. In colloquial styles of writing, these forms may be rendered in nonstandard ways, such as using the English source word call to write ko1, or writing the particles with special characters invented for Cantonese. In more formal writing, however, such forms must be left to the reader to interpolate from a character “borrowed” from some other morpheme. For example, ge3 (genitive particle) might be written 的, which more typically writes the morpheme dik1 in muk6dik1 目的 ‘aim’, but which suggests ge3 because it also writes a genitive particle in Mandarin (de in the Pinyin romanization). Thus, 的 has a reading dik1 that is etymologically related to the Mandarin morpheme, but it also has the etymologically independent reading ge3 because Cantonese readers can read texts written in Mandarin as if they were Cantonese. This multiplicity of readings arose first of all because the Canton City standard (which is the major source for the Hong Kong standard) has distinct colloquial and literary strata reflecting the diglossia that has characterized every major regional urban center of China for more than a millennium. In modern Hong Kong Cantonese, a character can have several different colloquial readings as well. These readings reflect sound changes in progress and a history of pervasive dialect contact between the original Canton City standard and the various other Cantonese and non-Cantonese dialects and languages that have contributed to the evolution of a distinct Hong Kong standard over the last 150 years. Finally, written Hong Kong Cantonese today is in a new relationship of intense language contact with standard (Putonghua) Mandarin, with many texts such as official policy announcements composed originally in Putonghua by non-Cantonese speakers. When such texts are incorporated into Cantonese texts, a Mandarin word with no familiar cognate in modern Cantonese can be rendered either by reading each character with some cognate Cantonese morpheme for the component Mandarin one, or by translating the whole word into a more familiar, etymologically unrelated Cantonese word. Such ambiguities of reading make the task of developing online wordlists from text corpora doubly difficult, since word segmentation is only half the task. Each character string that has been identified as a word also must be associated with an alphabetic transliteration of the most likely pronunciation in the context. Without such disambiguation, there is no way to be sure that the inventory of TTS units covers most of the attested intra-word sequences. Also, if word frequency is at issue (as it must be when designing materials for eliciting the unit inventory for a single voice), token frequencies should be calculated for the (disambiguated) transliterated forms of words, and not just for the (often ambiguous) orthographic forms that have been segmented from the text. In working with the Segmentation Corpus, therefore, our first task was to examine each entry, correcting many transliterations, and adjusting frequencies when a single orthographic form writes more than one (phonological) word. 2 The Segmentation Corpus The Segmentation Corpus is an electronic database of around 33,000 Cantonese word types extracted from a 1.7 million character corpus of Hong Kong newspapers, along with a tokenized record of the text. It is described in more detail in Chan & Tang (1999) and Tang (2001). The Cantonese corpus is part of a larger database of segmented Chinese texts, including Mandarin newspapers from both the PRC and Taiwan. The three databases were created using wordsegmentation criteria developed by researchers at the Chinese Language Centre and Department of Chinese and Bilingual Studies, Hong Kong Polytechnic University. The criteria were intended to be applicable to texts in all three Chinese varieties. Therefore, phonological criteria (such as the distribution of “neutral tone” syllables in Putonghua Mandarin) were not invoked. Rather, strings of characters were segmented from the texts based on the syntax, and the resulting word’s length, semantic transparency, and collocational frequency. Therefore, the “words” segmented from the Hong Kong newspaper text database are orthographic strings of characters rather than phonological word forms. That is, a “word” is a string of Chinese characters that (1) is an independent part of speech (e.g., the noun 盒子 hap6zi2 ‘box’ is one word, because 子 cannot stand alone in this use as a nominalizing affix even though 盒 hap2 can stand alone to mean ‘box’); (2) has a meaning that is not simply a sum of its parts (e.g., 火車 is the noun fo2ce1 ‘train’, not two words 火 ‘fire’ and ‘vehicle’); (3) does not consist of more than four characters; and (4) either is listed in Xiandai Hanyu Cidian or Zhongguo Chengyu DaCiDian or meets a predetermined frequency threshold (for strings of text not listed in these two dictionaries). For our current purpose, the most useful part of the Segmentation Corpus is the wordlist proper, a file containing a separate entry for each word type identified by the segmentation criteria. Each entry has three fields: the orthographic form(s), the pronunciation(s) in Jyutping, and the token frequency in the segmented newspaper corpus. In the original corpus, the first field could have multiple entries. For example, there are two character strings, 泛濫 and 氾濫, in the entry for the word faan6laam6 ‘to flood’. However, the two readings of 支援 were not listed separately in the pronunciation field for that orthographic form (and there was only one entry for the two words written with 正當). This was because the pronunciation field was encoded originally by inserting the Jyutping transliteration of just the first reading listed in the 1990 edition of Soeng1Mou6 San1 Ci4Din2 [ 商 務 新 詞 典 Commercial Press New Word Dictionary] for each component character in the orthographic form. The Soeng1Mou6 San1 Ci4Din2 is a character-based dictionary that lists Canton City standard pronunciations using both a traditional fanqie (initial-rhyme) character-based encoding and an alphabetic encoding (in the International Phonetic Alphabet). It does list multiple readings for some characters. For example, both “[jyn4]” and “[wun4]” are listed for 援, and both “[d1]” and “[d3]” are listed for 當 . However, it does not include modern Hong Kong colloquial pronunciations if they are different from the Canton City readings. Therefore, the original entry for 盒 ‘box’ was hap6 rather than hap2, because the Soeng1Mou6 San1 Ci4Din2 recognizes only three different tones on syllables closed with [p, t, k], although Hong Kong Cantonese has five tones in contrast on such “checked” syllables. Before we could use the wordlist, therefore, we had to check the pronunciation field for each entry. The first author, a native speaker of Hong Kong Cantonese, examined each entry in order to add variant readings not originally listed (as in the entry for 支援 ‘support’) and to correct readings that did not correspond to the modern Hong Kong pronunciation (as in the entry for 盒 ‘box’). In addition, when the added variant pronunciation corresponded to an identifiably different word (as in zing3dong3 ‘proper’ versus zing3dong1 ‘while’ for 正 當 ), the entry was split in two, and all tokens of that character string in the larger corpus were examined, in order to allocate the total token count for the orthographic form to the two separate frequencies for the two different words. Approximately 90 original entries were split into separate entries by this processing. In this way, the 32,840 entries in the original word list became 33,037 entries. Once this disambiguation of orthographic forms was completed, we were able to use the wordlist to count all of the distinct units (different diphones, demi-syllables, syllables, or “rhyme-initial combinations”) that would be needed to synthesize all of the words in the Segmentation Corpus. To explain what these units are, we need to describe the phonology of Cantonese words and larger utterances. 3 Cantonese phonology The smallest possible word is a nasal ([m] or []) or vowel as a bare tone-bearing syllable nucleus, as in 五 ng5 ‘five’ and 啞 aa2 ‘dumb’. A syllable with a vowel as nucleus can also have an onset consonant, and it can have a more complex rhyme which combines the vowel with any of the three coda nasals [m], [n], and [], the two glides [j] and [w], or the three coda stops [p], [t], and [k], as in 將 zoeng1 ‘will’, 胃 wai6 ‘stomach’, and 率 leot2 ‘rate’. Tables 1 and 2 show the 11 vowels and 19 consonants of Cantonese, and Figure 1 plots representative fundamental frequency contours for each of the six tone phonemes that contrast on a syllable with an all-sonorant rhyme. If there were no phonotactic restrictions on combinations of vowels and consonants, or on combinations of rhymes and tones, there would be 673 distinct rhymes. However, not all combinations are possible. For example, tone 5 does not occur on syllables with coda stops (and tone 4 on such checked syllables is limited to onomatopoeia). Also, the mid short vowel [e] does not occur in open syllables, and in closed syllables it occurs only before [k], [], and [j], whereas [i:] occurs in closed syllables only before [p], [t], [m], [n], and [w]. The Jyutping transliteration system takes advantage of this kind of complementary distribution to limit the number of vowel symbols. Thus “i” is used to transcribe both the short mid vowel [e] in the rhymes [ek], [e], and [ej], and the high vowel [i:] in the rhymes [i:], [i:t], [i:p], [i:m], [i:n], and [i:w]. Ignoring tone, there are 54 rhyme types distinguished in standard usage of Jyutping. Canonical forms of longer words can be described to a first approximation as strings of syllables, each of which has one of the same basic shapes just described for monosyllabic words. That is, Cantonese does not have the contrast between stressed (tone-bearing) syllables and reduced (“neutral tone”) syllables that gives the listener some clues to the (phonological) word segmentation of a standard Mandarin utterance. There are no forms such as Mandarin kuai4zi ‘chopsticks’, which is undeniably a single word because of the unstressed (toneless) nature of the affix syllable -zi. Even “particles” such as the aspect marker and the pragmatic morphemes that mark certain speech acts bear lexical tone. Therefore, we do not need to take syllable “stress” into account in specifying the inventory of concatenative units for synthesizing word of Cantonese (as we would have to do for Putonghua Mandarin), and we can focus instead on the effects of position in the word or phrase. One of these effects is the fusion of coda [p] with initial [h] in words such as 集合 zaap6 hap6. In fast speech or even in normal rates for casual speaking styles, such segmental sandhi effects can be much more extreme. Consonants can be reduced or deleted, with the abutting vowels fused together, as in the pronunciation [jy:n21la:212] for the phrase 原來係 jyun4loi4 hai6 ‘was’ (with the verb cliticized onto the preceding tense adverb), illustrated in Figure 2. Some very frequent words or phrases have even become lexicalized as fused forms. For example, 咩 me1 ‘what’ is etymologically a fusion of a disyllabic word mat1je5. In the careful speaking style which we aim to model initially, such extreme fusions as [jy:n21la:212] for jyun4loi4 hai6 are less common, and we will worry for now only about modelling lexicalized fusions and less drastic sandhi effects, such as fusion of coda [p] with following [h]. front round i: y: e : : central back round u: o : a: high mid (short) mid (long) low vowels Table 1. Vowels of Cantonese. labial dental ph p f m th, tsh t, ts s n, l palatal velar kh k labio velar khw kw h w Table 2. Consonants of Cantonese. 300 tone 1 tone 2 tone 3 tone 6 tone 5 tone 4 100 0 Time (s) Figure 1. Fundamental frequency contours for utterances of six monosyllabic words containing the segment sequence [wj] and contrasting in tone. The numbers to the left identify the starting points of contours for the four level tones (in black) and the numbers to the right identify the endpoints of the two level tones (in grey). (The 0.7 where the speaker breaks into creaky voice.) discontinuities in the F0 contour for waai4 show o5 jyun4 loi4 hai6 223 221 221 222 wai3 333 HL% laa4+6 250 100 0 1.56807 Time (s) Figure 2. Spectrogram with F0 track superimposed for an utterance of the sentence o5 jyun4loi4 hai6 wai3 ‘Oh, I get it. It was the character 慰!’ (The context is a dictation task.) The labelling window above the signal view shows a partial transcription in the annotation conventions proposed by Wong, Chan & Beckman (in press), with a syllable-by-syllable Jyutping transliteration (top tier), a transcription of the (canonical) lexical tones and boundary tones, and a phonetic transcription of fused forms (lowest tier). The HL% boundary tone is a pragmatic morpheme, which we have translated with the ‘Oh, I get it.’ phrase. Jyutping Chu & Ching Law & Lee diphones 經 濟 ging1 zai3 zai 學 家 hok6 gaa1 ging i$ho hok gaa# #g ing$z ai$h ok$g aa# #gi ing ng$z zai ai i$ho ok ga aa# Table 3. The string of basic units and exceptional units (underlined) that would be needed to utterance of the word ‘economist’ in each of the three models. Figure 2 also illustrates another positional effect that will affect the synthesis model. The final syllable in this utterance is sustained, to be longer than the four syllables combined. It also bears two extra tone targets for the HL% boundary tone. (See Wong, Chan & Beckman, in press, for a discussion of these utterance-level effects.) Phrase-final lengthening is not usually so extreme in read speech, and the inventory of boundary tones is more limited. However, there will be sufficient final lengthening to warrant the recording of separate units for (the rhymes of) final syllables. These two positional effects increase the inventory of units, albeit in different ways depending on the choice of “basic” unit. 4 Counting different unit types Table 3 illustrates the three models using the word 經 濟 學 家 ging1zai3hok6gaa1 ‘economist’. The first model is an adaptation of basic+ exc. units (w/tone) 1042 + 1449 1704 1061 synthesize an Chu & Ching (1997). The basic concatenative unit is the syllable, with some cross-syllable diphones added to capture word-internal sandhi. The second model is that of Law & Lee (2000), in which the basic unit is the syllable turned inside out — i.e. a combination of the rhyme of the preceding syllable plus the initial of the following syllable. The third model uses something like the traditional diphone, but differentiating codas from onset consonants. This model would have an advantage over Law & Lee’s cross-syllable final-initial combination model in that spectral continuity between the initial and rhyme is captured in the CV diphones (such as #gi and zai). Similarly, the diphones capture the dependency between the quality of the [h] and that of the following vowel (i.e., one records separate cross-syllable diphones for i$ho, i$hi, i$haa, and so on). The last column in Table 3 lists the theoretically possible number of basic units and (for the first model) of exceptional units that would have to be added to model the positional effects. For example, the basic Chu & Ching (1997) model is to string together full syllables. Multiplying the number of onsets by the number of rhymes, we get 1042 syllable types. (The exceptional units are an extra full set of syllables recorded in utterance final position and a set of cross-syllable diphones added to capture crosssyllabic sandhi effects word-internally.) Note that these counts do not take tone into consideration. However, since every syllable bears a (full) tone, and since tones are not reduced in running speech, recording different units for rhymes with different tones should improve the naturalness of the synthesis if fundamental frequency is modified using standard time-domain techniques such as PSOLA. (Even if better methods for pitch manipulation become available, the contour for tone 4 in Figure 1 indicates that voice quality can be affected by the tone specification, so it may still be necessary to record separate units for each tone.) Naturalness may also require different cross-syllabic units for different tone sequences when sonorant segments abut at syllable edges (so as to insure tonal continuity). Of course, when tone specification is taken into account, the number of necessary units grows substantially, particularly in the crosssyllabic final-initial model of Law & Lee (2000) and the diphone-only model. When we take tone into consideration in the Law & Lee model, the number of units becomes 9516. In the diphoneonly model, it becomes 16,200. This is likely to be too many, because few speakers can maintain consistent voice quality for the time that would be required. But when we count only those types that are attested in the words of the Segmentation Corpus, the total number is 4317. If each diphone were recorded in a disyllabic carrier word, a Cantonese speaker could speak all of the words to make a new voice in a single recording session. (For comparison, the number of attested diphones ignoring tone is 743.) Conclusion We have shown one way of using a segmented database to inform the design of a unit inventory for TTS. In its original form the Segmentation Corpus was not ideal for our purpose, because the focus of that project was the development of segmentation criteria that could be applied to cognate words across dialects of Chinese, rather than the development of a word list with dialectspecific pronunciations. We augmented the Segmentation Corpus with transliterations that would let us predict more accurately the pronunciation that a Cantonese speaker adopting a careful speaking style would be likely to produce for a character sequence. Judgements about the phonology of Cantonese, in combination with the new wordlist, and the associated word frequency data, can be used to assess the costs and likely benefits of different strategies for unit selection in Cantonese TTS. In particular, we present data indicating the feasibility of a new diphone selection strategy that finesses some of the problems in modelling the interactions between tone and segmental identity. It remains to be demonstrated that this strategy can actually deliver the results which it appears to promise. This is our future work. Acknowledgements This work was supported by a grant from the University Grants Committee of Hong Kong to Y. S. Cheung and an SBC/Ameritech Faculty Research Award to C. Brew and M. Beckman. References Chan S. D. and Tang Z. X. (1999) Quantitative Analysis of Lexical Distribution in Different Chinese Communities in the 1990’s. Yuyan Wenzi Yingyong [Applied Linguistics], No.3, 10-18 Chu M. and Ching P. C. (1997) A Cantonese synthesizer based on TD-PSOLA method. Proceedings of the 1997 International Symposium on Multimedia Information Processing. Academia Sinica, Taipei, Taiwan, Dec. 1997. Law K. M. and Lee Tan (2000) Using cross-syllable units for Cantonese speech synthesis. Proceedings of the 2000 International Conference on Spoken Language Processing, Beijing, China, Oct. 2000. Tang Z. X. (2001) Synchronic Situation and Development of Contemporary Chinese Lexis. Fudan University Press. Wong W. Y. P., Chan M. K-M., and Beckman M. E. (in press) An Autosegmental-Metrical analysis and prosodic conventions for Cantonese. To appear in S-A. Jun, ed. Prosodic Models and Transcription: Towards Prosodic Typology. Oxford University Press.