3 Corpus Description

advertisement
Report of Phonetic Research 2008
Multi-accent and Multi-lingual Speech Corpus*
WANG Xia
Nokia Research Center, Beijing
Graduate School of Chinese Academy of Social
Sciences
Beijing 100176, China
xia.s.wang@nokia.com
Institute of Linguistics, Chinese
Academy of Social Sciences
#5, Jianguomennei Dajie,
Beijing 100732, China
liaj@cass.org.cn
Similarly, when Chinese people speak English,
they speak with an accent too. The English
accent is also affected by their native language.
This corpus is designed for comparative
studies of phonetic aspects between Chinese
dialect, regional Mandarin, standard Mandarin,
EFL (English as foreign language) learners’
English and native English, which will benefit
speech recognition and synthesis, pronunciation
evaluation and language teaching as well.
The speech corpus includes 4 subsets:
(1) Cantonese accented Mandarin speech and
Cantonese dialect;
(2) Min accented Mandarin speech and Min
dialect.
(3) Standard Chinese;
(4) Chinese accented English by EFL learners
and American English speech: the learners are
from the above 3 subsets. Native American
English speakers are from the United States of
America.
Abstract
This paper will introduce a Multi-accent and
Multi-lingual Speech Corpus for 3 Chinese
Dialectal regions and western American regions:
Guangzhou for Cantonese (Yue) dialect, Xiamen
for Min dialect and Beijing for Standard Chinese.
For Chinese dialectal speakers, their dialect,
Mandarin Chinese and English speech data are
collected; for Beijing speakers, the standard
Chinese and English speech data are collected;
for American speakers, only their native speech
data are collected. The aim of collecting and
annotating this corpus is to make phonetic study
on accented Mandarin and accented English and
to benefit speech recognition, speech generation
and Computer Assisted Language Learning
(CALL) systems.
Keywords:
Multi-accent,
Multi-lingual,
Speech Corpus, Chinese dialects, English
1
LI Aijun, XIONG Ziyu, YIN Zhigang
Introduction
2
Spoken Chinese comprises many regional
varieties, called dialects. There are 9 dialectal
regions in China, i.e. Guan (Mandarin), Jin, Wu
(Shanghainese), Hui, Xiang, Gan, Kejia, Yue
(Cantonese) and Min [1]. People from different
dialectal areas might not be able to communicate
with each other simply because the differences
among the dialects are so significant.
Standard Mandarin, or Putonghua, would be a
good choice as a sharing basis. Most people in
China are bilingual Chinese speakers, i.e. native
dialect and Mandarin. Although lots of people
CAN speak Mandarin, they speak it with
different accents, depending on how well they
grasp the language. The Mandarin they speak is
always affected by their native dialects
phonetically, lexically and syntactically.
Researchers for Chinese speech recognition
have encountered huge problems because among
the 1.3 billion Chinese, very few could speak
Standard Mandarin. Most people speak
Mandarin as a second language and the
Mandarin they speak is unavoidably influenced
by their first language, whatever dialect that is.
Recording setup
The recordings were carried on in normal
quiet office environment for subset 1 and 2,
using a laptop plus M-Audio MobilePre USB
sound card, and a condenser microphone. The
rest of the database were recorded in a sound
proof recording studio at the Institute of
Linguistics, CASS. An in-house recording
software was used for isolated utterances in read
speech style, while CoolEdit was used for
spontaneous and dialogue speech.
3
Corpus Description
3.1 Subset 1: Cantonese and Yue-accented
Mandarin Speech Corpus
We recruited 30 university students in
Guangzhou, whose are native Cantonese
speakers, and are fluent with Putonghua
(accented Mandarin Chinese) and English
(English is not their major) as well. The speaker
should have moderate proficiency levels in
Putonghua and English, although their
pronunciation may have light to medium
166
Report of Phonetic Research 2008
accents.
3.4.
We recorded 3 ‘languages’ for each speaker:
The materials are divided into 10 prompt
their native dialect, Mandarin Chinese, and
sheets for read speech and 30 topics for
English.
spontaneous speech. Each speaker reads one
Table 1 describes the material of Yue dialect
prompt sheet presented to them by the recording
for each speaker and table 2 describes the
application. Then he or she picks up a topic for a
material for their accented Mandarin Chinese.
3-5 minutes’ free talk.
For English, please refer to table 4 in section
Detail
No. Content
1
Syllable Finals
About 72 items
2
Monosyllables
About 239 items
Phonetic Rich Words
About 188 items, including the following:
A. phrases (22)
3
B. words with all tonal combinations (35)
C. words with similar tones (48)
D. di-syllabic words covering inter-syllabic di-phones (83)
Phonetic Rich Sentences About 241 items, including 35 single sentences, 35 complex sentences, and
4
172 balanced sentences
Discourses
2 items, including “The North Wind and the Sun” and one discourse from 5
5
dialectal discourse candidates
6
Spontaneous Speech
3-5 minutes’ speech on one topic picked from 30 pre-defined topics
Table 1: Cantonese (Yue dialect) sub-corpus design for each speaker
No. Content
Detail
1
Initials and Finals
About 59 items
Monosyllables
About 149 items
2
Phonetic Rich Words
3
4
5
6
Phonetic
Sentences
Discourses
Rich
About 550 items, including the following:
A. retroflexed words (36)
B. tri-syllabic words (64)
C. Quadra-syllabic words (76)
D. di-syllabic words ending with a neutral tone (32)
E. di-syllabic words covering intersyllabic diphones (322)
F. tri-syllabic words with neutral tone in the middle (20)
About 150 items, including single sentences and complex sentences
3 items, including “The North Wind and the Sun” and 2 discourses selected from
60 discourses used in Mandarin Ability Test
Spontaneous Speech
3-5 minutes’ speech on one topic picked from 30 pre-defined topics
Table 2: Mandarin Chinese (PuTongHua) sub-corpus design for each speaker
pronunciation, the one outside is
canonical;
 PinYin: Pinyin transcription and syllable
boundaries;
 BI: break index, including pauses for
prosodic word (1), minor phrase (2) and
intonation phrase (3) ;
 ST: stress tier, 1-3 corresponding to
different stress levels;
 MIS: paralinguistic phenomenon, such as
BR (breathing).
3.2 Subset 2: Min and Min-accented
Mandarin Speech Corpus
We also recruited 30 university student speakers
in Xiamen with Min dialect as mother tongue.
The speakers should have moderate proficiency
levels in Putonghua and English as well.
Annotation
“jin4 xing2 ge4 zhong3 cao3 yuan2 shang4 de0
qing4 dian3 huo2 dong4.” (进行各种草原上的
庆典活动 – In English, “All kinds of traditional
celebrations are held on the prairies.”
Phonetic annotation has been made for all
recordings, including segmental and prosodic
annotation. Figure 1 describes an annotation
example for accented Mandarin.
As shown in Fig. 1, annotation tiers for
accented Mandarin Chinese include:
 HZ: Chinese characters;
 F: phonemes boundaries. “#” indicates
wrong pronunciation due to the influence
of particular local accents, and the
phoneme in the brackets is the real
167
Report of Phonetic Research 2008
Each speaker was asked to record 3 languages,
i.e. Min, Mandarin and English.
Table 3 describes the corpus design of Min
dialect for each speaker. Mandarin Chinese
material described in table 2 and English
described in table 4 were used to record their
Mandarin and English speech.
This subset is designed comparative phonetic
research on segmental and prosodic features
between native American English and Chinese
EFL learners, which will benefit the automatic
speech recognition for Chinese-accented English
speech.
This corpus is composed of two parts, i.e.
English speech from native American English
speakers, and accented English speech from
native Chinese as EFL learners. Detailed
information can be found from Table 4.
No.
1
2
3
Fig. 1: Annotation for accented Mandarin speech
Annotation
The annotation process is similar as described
in section 3.1. The annotation work is still on
going.
No.
Content
1
2
3
Syllable Finals
Monosyllables
Phonetic Rich
Words
4
Content
English
Alphabet
Phonetically
Rich
Words
and Sentences
Phonetically
Balanced
Sentences
Functional
Sentences
Detail
About 82 items
About 200 items
About 260 items, in
which 160 are common
for all speakers.
4
Phonetic Rich About 270 items, in
Sentences
which 61 are common
for all speakers.
5
Discourses
2 items, including “The
north wind and the sun”
and one discourse from
5 dialectal discourse
candidates
Table 3: Corpus Design for Min dialect
(XiaMen)
3.3 Subset 3: Standard Chinese Speech
Corpus
26 speakers from Beijing or North China area
participated the recording, whose mother
tongues are Standard Chinese and their English
are on the moderate level.
Both Standard Chinese and English are
recorded for each speaker. Table 2 describes the
Chinese material and Table 4 describes the
English material used in the recording.
Phonetic annotation has been made according
to the process defined in 3.1.
3.4 Subset 4: Native American English and
Chinese EFL learners’ English Speech
Corpus
5
Discourses
Detail
26 items
145 Words & 182 Sentences
237 items with
sentence types
various
400 items, including 250
questions, 90 commands, and
60 statements. Questions are
composed of 156 sets of
yes-no
questions;
69
wh-questions, 7 alternative
questions, 9 tag questions and
9
echo
questions.
Commands are composed of
80 sentences, and 10
dialogues.
2 discourses: Kung Fu Panda
scripts and a 200-word essay
Table 4: English sub-corpus design
The No.4 and No. 5 in Table 4 are designed
for two specific comparative prosodic studies.
One is about the English questions, and the other
is about the English commands.
As for the research on questions, with the
guidance of Intonation of Colloquial English
(O’Connor and Arnold, 1961), the test materials
compose of three parts, yes-no questions,
wh-questions and other questions. Wells points
out that a yes-no question is a query about
polarity (Wells, 2006). Moreover, Yi Xu points
out that statements and declarative questions
start to diverge from the stressed syllable of the
first content word (Liu and Xu, 2007). In order
to find out the effect of focus on F0 contours, the
focus locations (initial, medial, final) are all
included for yes-no questions and wh-questions.
Also the sentences with focus on the medial and
final positions are classified into two kinds: with
content words before the focus words or without.
For example: the sentence with focus at medial
168
Report of Phonetic Research 2008

position
Dialogue 1 A: I will return it this morning.
B: can I COUNT on that?
Dialogue 2 A: look at this coat.
B: Haven’t they made a MESS of it?
There is a verb “made” before “mess” in
dialogue 2 while there is no content word before
“count” in dialogue 1. What’s more, the stress of
the focus words are also considered (word final
and word non-final).
However, focus is not considered in the last
part for it is often certain in those questions.
As for the research on English commands,
according to different degrees of command
moods, English commands can be divided into
different types. Therefore, this part has been put
into recording in order to give a complete view
of commands. Furthermore, dialogues with
commands have also been added for a better and
more natural expression. Moreover, the reason
for adding discourse as a supplement in this part
is that except for sentences, the research scale of
prosodic research can be enlarged to articles.
Recording
Regions
# Speaker
Materials
Guangzhou
30
No. 1 and 2
Xiamen
30
No. 1 and 2
26 13
No. 3
Beijing
13
No. 3, 4, 5
26 13
No. 3
USA
13
No. 3, 4, 5
Table 5: Speaker information
According to different research purposes, the
number of speakers and the recording materials
get diversifying. Detailed introduction is listed
in the following Table 5. The first column of
Table 5 shows the recording regions; the second
column represents the number of speakers in that
region, and the last column is the materials
recorded corresponding to Table 4.Data
annotation was conducted using Praat. Speech
was first labeled by automatic segmentation
software, and then the syllable boundaries were
modified manually.
Annotation:
As shown in Fig. 2, annotation tiers for
English are:
 PH: boundaries of phonemes. “*” indicates
mis-pronunciation, and the phoneme in
brackets is the standard one. “-” indicates
phoneme-missing phenomenon, and the
phoneme in front of “-” is the missing one;
 WD: boundaries of each word;


BI: break index, including boundaries for
minor phrase (3) and intonation phrase (4);
ST: stress tier, 3 and 4 corresponding to
different stress levels;
BT: boundary tones.
Fig. 2 Annotation for English uttered by an EFL
speaker: “Americans hear word of massive
awards granted to consumers.”
4
Conclusion
This multi-lingual multi-accent speech
corpus consists of Chinese dialects, standard
Mandarin, accented Mandarin, native English
and Chinese spoken English. It provides a good
research basis for comparative study between
standard Mandarin vs. accented Mandarin, as
well as native English vs. non-native English,
from both segmental and prosodic perspectives.
The research results would benefit automatic
speech recognition for non-native speech, as
well as personalized speech synthesis as well.
References
Chinese Academy of Social Sciences,
LANGUAGE ATLAS OF CHINA, Longman
Group (Far East), 1987, 1990.
J. C. Wells, 2006, English Intonation, Cambridge
University Press, London.
J. D. O’Connor and G. F. Arnold, 1961,
Intonation of Colloquial English, Longman
Publishers, London.
Liu, F. and Xu, Yi, 2007, Interaction of word
stress, focus, and sentence type in English.
Journal of the Acoustical Society of America
121, Pt. 2, 3199.
http://www.fon.hum.uva.nl/praat
--*Published in Proc. of O-COCOSDA2008, Japan
169
Download