JEIDA Standard of Symbols for Japanese Text-to-Speech Synthesizers TANAKA Kazuyo1, AKABANE Makoto2, MINOWA Toshimitsu3, ITAHASHI Shuichi4 1 Electrotechnical Laboratory, 1-1-4 Umezono, Tsukuba, 305-8568, Japan. ktanaka@etl.go.jp SONY Corporation, 2-15-3 Konan Minato-ku, Tokyo, 108-6201, Japan. akabane@pdp.crl.sony.co.jp 3 Matsusita Communication Ind. Co., Ltd., Yokohama, 224-8539, Japan. minowa@adl.mci.mei.co.jp 4 University of Tsukuba, 1-1-1 Tennodai, Tsukuba, 305-8573, Japan. itahashi@milab.is.tsukuba.ac.jp 2 symbol system with several examples and comments. In “JEIDA-62-2000” version, the JEIDA Committee determined the symbols and description format that every TTS engine needs to accept as its input, that is, the TTS engine synthesizes speech by reading such a character sequence as is written by the standardized symbols and description format. Through the discussion in the Committee, we have considered that the symbols should have the following characterization: (1) they do not depend upon any specific applications nor platforms, such as hardware architectures, operating systems, programming languages, character codes, etc. , and also (2) they should have availability for wide applications. ABSTRACT This paper presents a standard of symbols commonly used for Japanese text-to-speech synthesizers. The standard has been discussed in the Speech Input/Output Systems Expert Committee of the Japan Electronic Industry Development Association (JEIDA), and was announced from the JEIDA as the JEIDA Standard named “JEIDA-62-2000” in March 2000. We describe here its basic policy and outline of the symbol system with several examples. 1. INTRODUCTION Recently, text-to-speech (TTS) synthesizers have rapidly spread to various applications and/or services in our life. Japanese makers thus put many TTS engines on the market. These TTS engines, however, employ individual (machine-dependent) interfaces and symbol systems, so that it becomes a serious obstacle when users implement application systems that incorporate any of the TTS engines. To help to resolve such a problem, the JEIDA Committee began, in 1995, to discuss standardization of symbols and input format used for TTS engines and, in March 2000, brought it to the first concluding version as the JEIDA Standard, “JEIDA-62-2000”[1]. At the same time, the JEIDA Committee also presented guidelines for speech synthesis system performance evaluation as “JEIDA Guideline” in this March[2]. In the following sections, we describe the outline of the 2. CLASSIFICATION OF THE SYMBOLS First of all, to make clear the characteristics of the symbols for TTS synthesizers, we classified this kind of symbols into categories shown in Table 1, where the vertical indicates level of description and the horizontal means features conveyed by the symbols. More specifically, it is shown in Fig. 1, which indicates the configuration of each symbol category in relation to TTS systems and shows typical text format examples. 1 Table 1: Classification of the Symbols for Japanese TTS Synthesizers Level Pronunciation feature Prosodic feature Control tag Text-level ----Kana-level Kana-level representation Text-embedded tags Pronunciation symbols Prosodic symbols Allophone-level Allophone-level representation Pronunciation symbols Prosodic symbols Input Text Japanese Text Analysis Text-level representation: <PRON SYM=”キョ’-”>今日</PRON>は天気 がよい。 Kana-level representation: (Kana) キョウ’ ワ|_テンキカ゜/ヨ’ イ. (Alphabetic) kyo’-wa|_te’nnkinga/yo’i. Allophone-level Analysis Allophone-level representation: (SAMPA) kjo!:wa||te!NkiNa|jo!i Synthesizer Engine Fig.1 JEIDA Symbols in relation to a TTS system. character sets for the representation symbols. Thus, the monosyllables are written by 1) a single Kana-character or character string (e.g. キョー), or 2) an alphabetic-character or character string (e.g. kyo-), as indicated in the Kana-level representation shown in Fig. 1. While these character representations are uniquely defined, their coding methods are not specified here to be independent of application platforms. The representation method does not meet with the Japanese orthographical transcription in several points. No characters are prepared for devocalizations of vowels because they are usually predictable from the phonetic contexts. On the other hand, nazalization of /g/, which is phonetically distinct but not recognized as a phonemic unit in the Japanese language, is explicitly defined by a 3. KANA-LEVEL REPRESENTATION 3.1 Pronunciation symbols The Kana-level Representation is used in a stage where a basic pronunciation of a word sequence is given based on a result of text analysis. Input Japanese texts are usually written with Chinese characters, Hira-kana and Kata-kana characters, Arabic numerals, etc, so that these texts need to be transformed to pronunciation symbol sequences using the text analysis result. The pronunciation symbols here represent a phonetic aspect of words except their prosodic features, and are uniquely defined for every Japanese monosyllable (see Appendix-I). In Japan, the pronunciation is usually written by Kana or alphabetic characters, so that we accepted both 2 5. TEXT-EMBEDDED CONTROL-TAGS different character, because this type of phonetic alteration is derived from the text analysis result. The Text-Embedded Control-Tags are defined to specify a kind of language, speaking rate, volume level of speech, part of speech, etc., included in the input text, for examples, and are written by XML format. We determined them by referring to several documents already announced, such as Microsoft Speech API (SAPI) [7] and the Aural Style Sheet of W3C (World Wide Web Consortium) [8]. At the same time we also considered the characteristics of the Japanese language and TTS engines for Japanese. Thus, most of such Control-Tags as overlapped and already defined in the SAPI were adopted in the JEIDA Standards, but several tags were added as components specifically used for Japanese TTS synthesizers. 3.2 Prosodic symbols The prosodic symbols here are defined for those prosodic features as accent position, accent phrase boundary, phrase boundary, sentence period and its intonation types (eg., normal, interrogative, or exclamatory), and pause insertion, contained in Japanese texts (see Appendix II). They are defined on the assumption that they are able to represent Tokyo dialect. We did not determine such symbols as representing fine prosodic phenomena, because it might be troublesome in implementing various applications. Several closer definitions are given in the Allophone-level Representation. 4. ALLOPHONE-LEVEL REPRESENTATION 5.1 Examples of Control-Tags Allophone-level Representation is determined for specifying more details of pronunciations. It is based on IPA representation of Japanese syllables. However, the IPA representation of Japanese itself is still including several ambiguous items, so that we referred to a report by Oonishi et al [3] in determining this standard. For easy use in computer programs, we adopted the XSAMPA (eXtended Speech Assessment Methods Phonetic Alphabet [5,6]) as ASCII code expression corresponding to the IPA. Prosodic features are also represented using symbols defined in the suprasegmentals of IPA. In representing the accentuation, we adopted symbols indicating rise and fall positions(see Appendix III), so that it is able to represent several dialects other than Tokyo dialect. Since the IPA does not intend to express general prosodic characteristics, the current set of the symbols for prosodic features is not yet enough to express details. We consider that the symbols for prosodic features needs further investigations in the future to compose a more convenient representation. the Text-Embedded The Control-Tags are categorized as those for controlling TTS systems, indicating pronunciations, and helping text analysis. Examples are shown in the following. (1) Tag examples for controlling TTS systems: BOOKMARK: inserting a bookmark. SPEECH: specifying a scope of the Tags defined in this standard. LANG: specifying a language. VOICE: specifying a font of voice, such as specified person’s voice tone. RESET: resetting values in the section corresponding to “SPEECH”. (2) Tag examples for indicating pronunciations: SILENCE: inserting a pause. EMPH: indicating a emphasis phrase, word, etc. SPELL: indicating reading by alphabetic character pronunciation of a word spelling. RATE: specifying a speaking rate. VOLUME: specifying volume of speech. PITCH: specifying average pitch level. (3) Tag examples for helping text analysis: 3 PARTOFSP: specifying a part of speech in the text. CONTEXT: specifying prior information in the scope. REGWORD: registration of a pronunciation and/or part of speech, etc, for a specified word. REFERENCES [1] Japan Electronic Industry Development Association (JEIDA) Standard: Standard of symbols for Japanese text-to-speech synthesizer (in Japanese), JEIDA-62-2000 (March, 2000). [2] JEIDA: Guidelines for speech synthesis system performance evaluation methods (in Japanese), JEIDA-G-2000 (March, 2000). S. Itahashi, “Guidelines for Japanese speech synthesizer evaluation,” Proc. LREC2000, pp.xxx-xxx, May, 2000. [3] M. Oonishi, S. Toki, M. Dantsuji, “Research report on phonetic symbol representation of Japanese,” Proceedings 1995 Meeting, Phonetic Association of Japan, pp.11-17, 1995. [4] Dafydd Gibbon, Roger Moore, Richard Winski, “Handbook of Standards and Resources for Spoken Language Systems,” Mouton de Gruyter, 1997. [5] SAMPA http://www.phon.ucl.ac.uk/home/sampa/home [6] Microsoft Speech SDK http://research.microsoft.com/stg/ [7] W3C - http://www.w3.org/TR/WD-CSS2/ [8] Voice XML - http://www.voicexml.org/ 5.2 Attributes Specification Two ways are acceptable in specifying attributes of the Tags, such as speaking rate, pitch level, volume of speech, etc. One is to express by an absolute value (or level), and the other is by relative value (or level) in relation to the preceding one, as shown in the following: (1) Using the absolute value: (The name of attribute is “AbsAttr”.) <TagName AbsAttr=”5”> <The value is set to 5> <TagName AbsAttr=”6”> <The value is set to 6> </TagName> <The value is reset to 5> </TagName > <The value is reset to one before “5”> (2)Using the relative value: (The name of attribute is “AbsAttr” and the reference value of RelAttr=”0”.) <TagName RelAttr=”+5”> <The value is set to 5> <TagName RelAttr=”+6”> <The value is set to 11> </TagName> <The value is reset to 5> </TagName> <The value is reset to 0> 6. CONCLUDING REMARKS The JEIDA Standard of Symbols “JEIDA-62-2000” is the first version that defined primary part of the symbol system for Japanese TTS synthesizers. It will be revised in the future to meet with the progress of TTS engines and/or their application systems. ACKNOWLEDGMENT The authors wish to thank all the members of the Speech Input/Output Systems Expert Committee of JEIDA for their intensive effort and discussion to establish this Standard of Symbols. 4 Appendix-I: Kana-level Symbols for Japanese monosyllables pronunciation. 表2-1 読み記号 ア a カ ka サ sa タ ta ナ na ハ ha マ ma ラ ra ガ ga ザ za ダ da バ ba パ pa ヴァ va カ゜ nga ン nn イ i キ ki シ shi チ chi ニ ni ヒ hi ミ mi リ ri ギ gi ジ ji ディ di ビ bi ピ pi ティ ti ヴィ vi キ゜ ngi スィ si ズィ zi ッ q ウ u ク ku ス su ツ t su ヌ nu フ hu ム mu ル ru グ gu ズ zu ドゥ du ブ bu プ pu ト ゥ tu ヴ vu ク゜ ngu エ e ケ ke セ se テ te ネ ne ヘ he メ me レ re ゲ ge ゼ ze デ de ベ be ペ pe オ o コ ko ソ so ト to ノ no ホ ho モ mo ロ ro ゴ go ゾ zo ド do ボ bo ポ po ヤ ユ ya yu キャ キュ kya kyu シャ シュ sha shu チャ チュ cha chu ニャ ニュ nya nyu ヒャ ヒュ hya hyu ミャ ミュ mya myu リャ リュ r ya r yu ギャ ギュ gya gyu ジャ ジュ ja ju デャ デュ dya dyu ビャ ビュ bya byu ピャ ピュ pya pyu テャ テュ t ya t yu ヴェ ヴォ ヴャ ヴュ ve vo vya vyu ケ゜ コ ゜ キ゜ ャ キ゜ ュ nge ngo ngya ngyu フャ フュ f ya f yu イェ ヨ ye yo キェ キョ kye kyo シェ ショ she sho チェ チョ che cho ニェ ニョ nye nyo ヒェ ヒョ hye hyo ミェ ミョ mye myo リェ リョ r ye r yo ギェ ギョ gye gyo ジェ ジョ je jo ディ ェ デョ dye dyo ビェ ビョ bye byo ピェ ピョ pye pyo ティ ェ テョ t ye t yo ヴィ ェ ヴョ vye vyo キ゜ ェ キ゜ ョ ngye ngyo フィ ェ フョ f ye f yo ー - 5 ワ wa クァ kwa スァ swa ツァ t sa ヌァ nwa ファ fa ムァ mwa ルァ r wa グァ gwa ズァ zwa ドゥァ dwa ブァ bwa プァ pwa ト ゥァ t wa ヴゥ ァ vwa ク゜ ァ ngwa ウィ wi クィ kwi スゥ ィ swi ツィ t si ヌィ nwi フィ fi ムィ mwi ルィ r wi グィ gwi ズィ zwi ドゥィ dwi ブィ bwi プィ pwi ト ゥィ t wi ヴゥ ィ vwi ク゜ ィ ngwi ウェ we クェ kwe スェ swe ツェ t se ヌェ nwe フェ fe ムェ mwe ルェ r we グェ gwe ズェ zwe ドゥェ dwe ブェ bwe プェ pwe ト ゥェ t we ヴゥ ェ vwe ク゜ ェ ngwe ウォ wo クォ kwo スォ swo ツォ t so ヌォ nwo フォ fo ムォ mwo ルォ r wo グォ gwo ズォ zwo ドゥォ dwo ブォ bwo プォ pwo ト ゥォ t wo ヴゥ ォ vwo ク゜ ォ ngwo Appendix-II: Kana-level Symbols for Prosodic Features. Function Accent position Accent phrase boundary Phrase boundary End of sentence, Normal/assertive intonation End of sentence, Interrogative intonation End of sentence, Exclamatory intonation 1-mora pause insertion Symbol ’ / | . ? ! _ (The ASCII codes are represented by the decimal system.) ASCII code 039 047 124 046 063 033 095 Appendix-III: Allophone-level Symbols for Prosodic Features. Function XSAMPA ASCII code Accent, Upstep ^ 094 Accent, Downstep ! 033 Accent phrase boundary, Minor (foot) group | 124 Phrase boundary, Major (intonation) group || 124 124 End of sentence, Global fall <F> 060 070 062 End of sentence, Global rise <R> 060 082 062 1-mora pause insertion ... 046 046 046 (The IPA symbols corresponding to XSAMPA’s are omitted here.) 6