Word-level Dependency-structure Annotation to Corpus of

advertisement
Word-level Dependency-structure
Annotation to Corpus of Spontaneous
Japanese and Its Application
Kiyotaka Uchimoto*
Yasuharu Den†
*National Institute of Information and
Communications Technology (NICT)
† Chiba University
Outline
Background
Dependency Structure in the CSJ
Dependency-structure Annotation
Word-level Dependency-structure Analysis
Towards Construction of Middle Words
Summary and future work
Background (1)
Corpus of Spontaneous Japanese (CSJ)
[Maekawa et al., 2000]



The largest spontaneous-speech corpus in the
world
Include transcriptions of speeches as well as
audio recordings
One tenth of the CSJ has been manually
annotated with
• Morphemes, sentence boundaries, syntactic structures,
discourse structures, prosodic information, etc
Background (2)
Syntactic structure of a sentence


Represented by dependency relationships
between bunsetus
As represented in the Kyoto University text
corpus
Syntactic structure of a bunsetsu is not
considered
nihon gata kokusai kouken ga
(Japanese style international contribution)
motome rare te iru (is required)
Dependency Structure in the CSJ (1)
Dependency relationships between bunsetsus

Annotated within “sentences” in the CSJ
Dependency relationships between words


Annotated within bunsetsus
Word segments in the word-level dependency structure:
short words
• Short word approximates a term found in an ordinary
dictionary
• Long word represents various compounds
nihon gata kokusai kouken ga
(Japanese style international contribution)
motome rare te iru (is required)
Dependency Structure in the CSJ (2)
Disfluencies characteristic to spontaneous speech
Self-correction

• Represented as dependency between bunsetsus, and label D is
assigned to them
Yamada
D
(Yamada)
Yamada san wa
(Mr. Yamada)
kyoujin na
(strong)
nikutai no
(body)
mochinushi da to
(possessor)
it te mashi ta ne (said)
(Yamada, Mr. Yamada said that he had a strong body.)
Dependency Structure in the CSJ (3)
Disfluencies characteristic to spontaneous speech

Self-correction
• Represented as dependency between words, and label D is
assigned to them
kokuritsu
Nihon
go D
kokugo
kenkyuu
jo
de
(national)
(Japanese)
(word)
(Japanese language)
(research)
(institure)
case marker
(At National Japanese word, Japanese language research institute)
Dependency-structure Annotation
Manual annotation


199 speeches for dependency relationships between
bunsetsus
50 speeches for dependency relationships between
words
Human annotation by using a tool



Initial: every bunsetsu depends on the next
Step 1: two annotators examined each dependency and
modified it if it was inappropriate
Step 2: a checker examined all dependencies
• Referred to audio recordings as well as transcriptions
Modified by
mouse dragand-drop
Each line
represents a
bunsetsu
Self-corrections, coordination, and
appositives can be annotated with
labels D, P, and A by right-clicking the
mouse
Each line
represents a
word
Modified by
mouse dragand-drop
Word-level Dependency-structure
Analysis (1)
Finding a modifiee for each word in a bunsetsu


Each dependency goes from left to right
The rightmost word is assumed to have no modifiee
Existing methods were applied

Ex. shift-reduce method [Nivre and Scholz, 2004]
nihon/noun gata/Suffix kokusai/noun kouken/noun ga/ppp
nihon gata
stack
Input words
kouken
ga …
kokusai
Word-level Dependency-structure
Analysis (2)
 Experiments

50 speeches in the CSJ
• Word-level dependencies (total: 33,429)
– Every rightmost dependency in a bunsetsu was not counted


10-fold cross validation
Features: words and their POS categories
Method
Baseline
Shift-reduce (Nivre & Scholz, 2004)
MST parser (McDonald et al., 2005)
CaboCha (Kudo and Matsumoto, 2000)
Dependency
accuracy
98.6%
99.1%
99.1%
99.1%
Application of Word-level
Dependency-structure
In text-to-speech synthesis

Long
word
Basic unit is required to indicate appropriate
pronunciation and accent
dandanba^take
gairaigokanahyouki
manyogana
Middle
hyouki
Short
da^ndan
hatake
gairai
kana kanahyouki
manyo
kana
dandanbatake
gairaigo go
manyogana
word (layered) (fields) (foreign) (word) (kana) (orthography) (myriad) (kana)
hyouki
Short
da^ndan hatake
gairai
go
kana
manyo
kana
word (layered) (fields) (foreign) (word) (kana) (orthography) (myriad) (kana)
“rendaku” (Weijer et al., 2005)
Application of Word-level
Dependency-structure
A sound change or an accent change are
blocked by right branched tree structures
(Kubozono, 1995)
Long
word
dandanba^take
Middle
word
dandanbatake
gairaigokanahyouki
gairaigo
kanahyouki
manyogana
manyogana
hyouki
Short
da^ndan hatake
gairai
go
kana
manyo
kana
word (layered) (fields) (foreign) (word) (kana) (orthography) (myriad) (kana)
Construction of Middle Words
 Construction rule

Combining adjacent short words that have dependency
relationships under the condition that a middle word is not longer
than a long word
 Morphological information

If a middle word corresponds to a long word
• Extracted from the long word.

Otherwise
• Extracted from the rightmost short word in the middle word.
 Example
kihon/shuuha/suu/pataan
Noun Noun Suffix Noun
(basic frequency pattern)
kihon | shuuha suu pataan
Noun
Noun
Middle Words and Accent Phrases

Relationships between middle words and accent phrases (BI=2, 2+p, 2+b,
2+bp, 3) in the CSJ
Long words (LW) (97,167)
No accent phrase
boundary (APB) in LW
Accent phrase boundary (APB) in LW
94,038
3,129
LW = MW
LW > MW
APB in
MW
No APB
in MW
MW boundary
corresponds to LW
boundary or APB
MW boundary
corresponds neither
to LW boundary
nor to APB
93,797
241
2,942
187
3,075
54
nihonjin/gakushuusha
rittai/chuushajou
kaku|zokusei
gen|jiten
zen|shikiichi
should be reduced
emuten|chuuouchi/heikatsuka
yuudo/saidaika|kijun
Summary and Future Work
Dependency structure of a large, spontaneous,
Japanese-speech corpus, Corpus of Spontaneous
Japanese (CSJ)
Application of a word-level dependency-structure


Constructing new basic units, middle words
Middle words: useful as constituents of accent phrases
Annotation to the Balanced Corpus of
Contemporary Written Japanese (BCCWJ)

Supported by the priority area program ‘Japanese
Corpus’, a five-year (2006-2010) project
Download