Report of Phonetic Research 2008 Multi-accent and Multi-lingual Speech Corpus* WANG Xia Nokia Research Center, Beijing Graduate School of Chinese Academy of Social Sciences Beijing 100176, China xia.s.wang@nokia.com Institute of Linguistics, Chinese Academy of Social Sciences #5, Jianguomennei Dajie, Beijing 100732, China liaj@cass.org.cn Similarly, when Chinese people speak English, they speak with an accent too. The English accent is also affected by their native language. This corpus is designed for comparative studies of phonetic aspects between Chinese dialect, regional Mandarin, standard Mandarin, EFL (English as foreign language) learners’ English and native English, which will benefit speech recognition and synthesis, pronunciation evaluation and language teaching as well. The speech corpus includes 4 subsets: (1) Cantonese accented Mandarin speech and Cantonese dialect; (2) Min accented Mandarin speech and Min dialect. (3) Standard Chinese; (4) Chinese accented English by EFL learners and American English speech: the learners are from the above 3 subsets. Native American English speakers are from the United States of America. Abstract This paper will introduce a Multi-accent and Multi-lingual Speech Corpus for 3 Chinese Dialectal regions and western American regions: Guangzhou for Cantonese (Yue) dialect, Xiamen for Min dialect and Beijing for Standard Chinese. For Chinese dialectal speakers, their dialect, Mandarin Chinese and English speech data are collected; for Beijing speakers, the standard Chinese and English speech data are collected; for American speakers, only their native speech data are collected. The aim of collecting and annotating this corpus is to make phonetic study on accented Mandarin and accented English and to benefit speech recognition, speech generation and Computer Assisted Language Learning (CALL) systems. Keywords: Multi-accent, Multi-lingual, Speech Corpus, Chinese dialects, English 1 LI Aijun, XIONG Ziyu, YIN Zhigang Introduction 2 Spoken Chinese comprises many regional varieties, called dialects. There are 9 dialectal regions in China, i.e. Guan (Mandarin), Jin, Wu (Shanghainese), Hui, Xiang, Gan, Kejia, Yue (Cantonese) and Min [1]. People from different dialectal areas might not be able to communicate with each other simply because the differences among the dialects are so significant. Standard Mandarin, or Putonghua, would be a good choice as a sharing basis. Most people in China are bilingual Chinese speakers, i.e. native dialect and Mandarin. Although lots of people CAN speak Mandarin, they speak it with different accents, depending on how well they grasp the language. The Mandarin they speak is always affected by their native dialects phonetically, lexically and syntactically. Researchers for Chinese speech recognition have encountered huge problems because among the 1.3 billion Chinese, very few could speak Standard Mandarin. Most people speak Mandarin as a second language and the Mandarin they speak is unavoidably influenced by their first language, whatever dialect that is. Recording setup The recordings were carried on in normal quiet office environment for subset 1 and 2, using a laptop plus M-Audio MobilePre USB sound card, and a condenser microphone. The rest of the database were recorded in a sound proof recording studio at the Institute of Linguistics, CASS. An in-house recording software was used for isolated utterances in read speech style, while CoolEdit was used for spontaneous and dialogue speech. 3 Corpus Description 3.1 Subset 1: Cantonese and Yue-accented Mandarin Speech Corpus We recruited 30 university students in Guangzhou, whose are native Cantonese speakers, and are fluent with Putonghua (accented Mandarin Chinese) and English (English is not their major) as well. The speaker should have moderate proficiency levels in Putonghua and English, although their pronunciation may have light to medium 166 Report of Phonetic Research 2008 accents. 3.4. We recorded 3 ‘languages’ for each speaker: The materials are divided into 10 prompt their native dialect, Mandarin Chinese, and sheets for read speech and 30 topics for English. spontaneous speech. Each speaker reads one Table 1 describes the material of Yue dialect prompt sheet presented to them by the recording for each speaker and table 2 describes the application. Then he or she picks up a topic for a material for their accented Mandarin Chinese. 3-5 minutes’ free talk. For English, please refer to table 4 in section Detail No. Content 1 Syllable Finals About 72 items 2 Monosyllables About 239 items Phonetic Rich Words About 188 items, including the following: A. phrases (22) 3 B. words with all tonal combinations (35) C. words with similar tones (48) D. di-syllabic words covering inter-syllabic di-phones (83) Phonetic Rich Sentences About 241 items, including 35 single sentences, 35 complex sentences, and 4 172 balanced sentences Discourses 2 items, including “The North Wind and the Sun” and one discourse from 5 5 dialectal discourse candidates 6 Spontaneous Speech 3-5 minutes’ speech on one topic picked from 30 pre-defined topics Table 1: Cantonese (Yue dialect) sub-corpus design for each speaker No. Content Detail 1 Initials and Finals About 59 items Monosyllables About 149 items 2 Phonetic Rich Words 3 4 5 6 Phonetic Sentences Discourses Rich About 550 items, including the following: A. retroflexed words (36) B. tri-syllabic words (64) C. Quadra-syllabic words (76) D. di-syllabic words ending with a neutral tone (32) E. di-syllabic words covering intersyllabic diphones (322) F. tri-syllabic words with neutral tone in the middle (20) About 150 items, including single sentences and complex sentences 3 items, including “The North Wind and the Sun” and 2 discourses selected from 60 discourses used in Mandarin Ability Test Spontaneous Speech 3-5 minutes’ speech on one topic picked from 30 pre-defined topics Table 2: Mandarin Chinese (PuTongHua) sub-corpus design for each speaker pronunciation, the one outside is canonical; PinYin: Pinyin transcription and syllable boundaries; BI: break index, including pauses for prosodic word (1), minor phrase (2) and intonation phrase (3) ; ST: stress tier, 1-3 corresponding to different stress levels; MIS: paralinguistic phenomenon, such as BR (breathing). 3.2 Subset 2: Min and Min-accented Mandarin Speech Corpus We also recruited 30 university student speakers in Xiamen with Min dialect as mother tongue. The speakers should have moderate proficiency levels in Putonghua and English as well. Annotation “jin4 xing2 ge4 zhong3 cao3 yuan2 shang4 de0 qing4 dian3 huo2 dong4.” (进行各种草原上的 庆典活动 – In English, “All kinds of traditional celebrations are held on the prairies.” Phonetic annotation has been made for all recordings, including segmental and prosodic annotation. Figure 1 describes an annotation example for accented Mandarin. As shown in Fig. 1, annotation tiers for accented Mandarin Chinese include: HZ: Chinese characters; F: phonemes boundaries. “#” indicates wrong pronunciation due to the influence of particular local accents, and the phoneme in the brackets is the real 167 Report of Phonetic Research 2008 Each speaker was asked to record 3 languages, i.e. Min, Mandarin and English. Table 3 describes the corpus design of Min dialect for each speaker. Mandarin Chinese material described in table 2 and English described in table 4 were used to record their Mandarin and English speech. This subset is designed comparative phonetic research on segmental and prosodic features between native American English and Chinese EFL learners, which will benefit the automatic speech recognition for Chinese-accented English speech. This corpus is composed of two parts, i.e. English speech from native American English speakers, and accented English speech from native Chinese as EFL learners. Detailed information can be found from Table 4. No. 1 2 3 Fig. 1: Annotation for accented Mandarin speech Annotation The annotation process is similar as described in section 3.1. The annotation work is still on going. No. Content 1 2 3 Syllable Finals Monosyllables Phonetic Rich Words 4 Content English Alphabet Phonetically Rich Words and Sentences Phonetically Balanced Sentences Functional Sentences Detail About 82 items About 200 items About 260 items, in which 160 are common for all speakers. 4 Phonetic Rich About 270 items, in Sentences which 61 are common for all speakers. 5 Discourses 2 items, including “The north wind and the sun” and one discourse from 5 dialectal discourse candidates Table 3: Corpus Design for Min dialect (XiaMen) 3.3 Subset 3: Standard Chinese Speech Corpus 26 speakers from Beijing or North China area participated the recording, whose mother tongues are Standard Chinese and their English are on the moderate level. Both Standard Chinese and English are recorded for each speaker. Table 2 describes the Chinese material and Table 4 describes the English material used in the recording. Phonetic annotation has been made according to the process defined in 3.1. 3.4 Subset 4: Native American English and Chinese EFL learners’ English Speech Corpus 5 Discourses Detail 26 items 145 Words & 182 Sentences 237 items with sentence types various 400 items, including 250 questions, 90 commands, and 60 statements. Questions are composed of 156 sets of yes-no questions; 69 wh-questions, 7 alternative questions, 9 tag questions and 9 echo questions. Commands are composed of 80 sentences, and 10 dialogues. 2 discourses: Kung Fu Panda scripts and a 200-word essay Table 4: English sub-corpus design The No.4 and No. 5 in Table 4 are designed for two specific comparative prosodic studies. One is about the English questions, and the other is about the English commands. As for the research on questions, with the guidance of Intonation of Colloquial English (O’Connor and Arnold, 1961), the test materials compose of three parts, yes-no questions, wh-questions and other questions. Wells points out that a yes-no question is a query about polarity (Wells, 2006). Moreover, Yi Xu points out that statements and declarative questions start to diverge from the stressed syllable of the first content word (Liu and Xu, 2007). In order to find out the effect of focus on F0 contours, the focus locations (initial, medial, final) are all included for yes-no questions and wh-questions. Also the sentences with focus on the medial and final positions are classified into two kinds: with content words before the focus words or without. For example: the sentence with focus at medial 168 Report of Phonetic Research 2008 position Dialogue 1 A: I will return it this morning. B: can I COUNT on that? Dialogue 2 A: look at this coat. B: Haven’t they made a MESS of it? There is a verb “made” before “mess” in dialogue 2 while there is no content word before “count” in dialogue 1. What’s more, the stress of the focus words are also considered (word final and word non-final). However, focus is not considered in the last part for it is often certain in those questions. As for the research on English commands, according to different degrees of command moods, English commands can be divided into different types. Therefore, this part has been put into recording in order to give a complete view of commands. Furthermore, dialogues with commands have also been added for a better and more natural expression. Moreover, the reason for adding discourse as a supplement in this part is that except for sentences, the research scale of prosodic research can be enlarged to articles. Recording Regions # Speaker Materials Guangzhou 30 No. 1 and 2 Xiamen 30 No. 1 and 2 26 13 No. 3 Beijing 13 No. 3, 4, 5 26 13 No. 3 USA 13 No. 3, 4, 5 Table 5: Speaker information According to different research purposes, the number of speakers and the recording materials get diversifying. Detailed introduction is listed in the following Table 5. The first column of Table 5 shows the recording regions; the second column represents the number of speakers in that region, and the last column is the materials recorded corresponding to Table 4.Data annotation was conducted using Praat. Speech was first labeled by automatic segmentation software, and then the syllable boundaries were modified manually. Annotation: As shown in Fig. 2, annotation tiers for English are: PH: boundaries of phonemes. “*” indicates mis-pronunciation, and the phoneme in brackets is the standard one. “-” indicates phoneme-missing phenomenon, and the phoneme in front of “-” is the missing one; WD: boundaries of each word; BI: break index, including boundaries for minor phrase (3) and intonation phrase (4); ST: stress tier, 3 and 4 corresponding to different stress levels; BT: boundary tones. Fig. 2 Annotation for English uttered by an EFL speaker: “Americans hear word of massive awards granted to consumers.” 4 Conclusion This multi-lingual multi-accent speech corpus consists of Chinese dialects, standard Mandarin, accented Mandarin, native English and Chinese spoken English. It provides a good research basis for comparative study between standard Mandarin vs. accented Mandarin, as well as native English vs. non-native English, from both segmental and prosodic perspectives. The research results would benefit automatic speech recognition for non-native speech, as well as personalized speech synthesis as well. References Chinese Academy of Social Sciences, LANGUAGE ATLAS OF CHINA, Longman Group (Far East), 1987, 1990. J. C. Wells, 2006, English Intonation, Cambridge University Press, London. J. D. O’Connor and G. F. Arnold, 1961, Intonation of Colloquial English, Longman Publishers, London. Liu, F. and Xu, Yi, 2007, Interaction of word stress, focus, and sentence type in English. Journal of the Acoustical Society of America 121, Pt. 2, 3199. http://www.fon.hum.uva.nl/praat --*Published in Proc. of O-COCOSDA2008, Japan 169