Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application Kiyotaka Uchimoto* Yasuharu Den† *National Institute of Information and Communications Technology (NICT) † Chiba University Outline Background Dependency Structure in the CSJ Dependency-structure Annotation Word-level Dependency-structure Analysis Towards Construction of Middle Words Summary and future work Background (1) Corpus of Spontaneous Japanese (CSJ) [Maekawa et al., 2000] The largest spontaneous-speech corpus in the world Include transcriptions of speeches as well as audio recordings One tenth of the CSJ has been manually annotated with • Morphemes, sentence boundaries, syntactic structures, discourse structures, prosodic information, etc Background (2) Syntactic structure of a sentence Represented by dependency relationships between bunsetus As represented in the Kyoto University text corpus Syntactic structure of a bunsetsu is not considered nihon gata kokusai kouken ga (Japanese style international contribution) motome rare te iru (is required) Dependency Structure in the CSJ (1) Dependency relationships between bunsetsus Annotated within “sentences” in the CSJ Dependency relationships between words Annotated within bunsetsus Word segments in the word-level dependency structure: short words • Short word approximates a term found in an ordinary dictionary • Long word represents various compounds nihon gata kokusai kouken ga (Japanese style international contribution) motome rare te iru (is required) Dependency Structure in the CSJ (2) Disfluencies characteristic to spontaneous speech Self-correction • Represented as dependency between bunsetsus, and label D is assigned to them Yamada D (Yamada) Yamada san wa (Mr. Yamada) kyoujin na (strong) nikutai no (body) mochinushi da to (possessor) it te mashi ta ne (said) (Yamada, Mr. Yamada said that he had a strong body.) Dependency Structure in the CSJ (3) Disfluencies characteristic to spontaneous speech Self-correction • Represented as dependency between words, and label D is assigned to them kokuritsu Nihon go D kokugo kenkyuu jo de (national) (Japanese) (word) (Japanese language) (research) (institure) case marker (At National Japanese word, Japanese language research institute) Dependency-structure Annotation Manual annotation 199 speeches for dependency relationships between bunsetsus 50 speeches for dependency relationships between words Human annotation by using a tool Initial: every bunsetsu depends on the next Step 1: two annotators examined each dependency and modified it if it was inappropriate Step 2: a checker examined all dependencies • Referred to audio recordings as well as transcriptions Modified by mouse dragand-drop Each line represents a bunsetsu Self-corrections, coordination, and appositives can be annotated with labels D, P, and A by right-clicking the mouse Each line represents a word Modified by mouse dragand-drop Word-level Dependency-structure Analysis (1) Finding a modifiee for each word in a bunsetsu Each dependency goes from left to right The rightmost word is assumed to have no modifiee Existing methods were applied Ex. shift-reduce method [Nivre and Scholz, 2004] nihon/noun gata/Suffix kokusai/noun kouken/noun ga/ppp nihon gata stack Input words kouken ga … kokusai Word-level Dependency-structure Analysis (2) Experiments 50 speeches in the CSJ • Word-level dependencies (total: 33,429) – Every rightmost dependency in a bunsetsu was not counted 10-fold cross validation Features: words and their POS categories Method Baseline Shift-reduce (Nivre & Scholz, 2004) MST parser (McDonald et al., 2005) CaboCha (Kudo and Matsumoto, 2000) Dependency accuracy 98.6% 99.1% 99.1% 99.1% Application of Word-level Dependency-structure In text-to-speech synthesis Long word Basic unit is required to indicate appropriate pronunciation and accent dandanba^take gairaigokanahyouki manyogana Middle hyouki Short da^ndan hatake gairai kana kanahyouki manyo kana dandanbatake gairaigo go manyogana word (layered) (fields) (foreign) (word) (kana) (orthography) (myriad) (kana) hyouki Short da^ndan hatake gairai go kana manyo kana word (layered) (fields) (foreign) (word) (kana) (orthography) (myriad) (kana) “rendaku” (Weijer et al., 2005) Application of Word-level Dependency-structure A sound change or an accent change are blocked by right branched tree structures (Kubozono, 1995) Long word dandanba^take Middle word dandanbatake gairaigokanahyouki gairaigo kanahyouki manyogana manyogana hyouki Short da^ndan hatake gairai go kana manyo kana word (layered) (fields) (foreign) (word) (kana) (orthography) (myriad) (kana) Construction of Middle Words Construction rule Combining adjacent short words that have dependency relationships under the condition that a middle word is not longer than a long word Morphological information If a middle word corresponds to a long word • Extracted from the long word. Otherwise • Extracted from the rightmost short word in the middle word. Example kihon/shuuha/suu/pataan Noun Noun Suffix Noun (basic frequency pattern) kihon | shuuha suu pataan Noun Noun Middle Words and Accent Phrases Relationships between middle words and accent phrases (BI=2, 2+p, 2+b, 2+bp, 3) in the CSJ Long words (LW) (97,167) No accent phrase boundary (APB) in LW Accent phrase boundary (APB) in LW 94,038 3,129 LW = MW LW > MW APB in MW No APB in MW MW boundary corresponds to LW boundary or APB MW boundary corresponds neither to LW boundary nor to APB 93,797 241 2,942 187 3,075 54 nihonjin/gakushuusha rittai/chuushajou kaku|zokusei gen|jiten zen|shikiichi should be reduced emuten|chuuouchi/heikatsuka yuudo/saidaika|kijun Summary and Future Work Dependency structure of a large, spontaneous, Japanese-speech corpus, Corpus of Spontaneous Japanese (CSJ) Application of a word-level dependency-structure Constructing new basic units, middle words Middle words: useful as constituents of accent phrases Annotation to the Balanced Corpus of Contemporary Written Japanese (BCCWJ) Supported by the priority area program ‘Japanese Corpus’, a five-year (2006-2010) project