poster

advertisement
Automatic Labeling of Intonation Using Acoustic and Lexical Features
Agnieszka Wagner
Department of Phonetics, Institute of Linguistics, Adam Mickiewicz University in Poznań, Poland
wagner@amu.edu.pl
Motivation

Determination of accentuation and phrasing
growing interest in speech technologies such as speech synthesis or
recognition for which speech corpora annotated at the segmental and
suprasegmental level are required

suprasegmental annotation provides information on utterance’s intonation
which is significant to the output quality of speech applications (e.g.,
Gallwitz et al. 2002, Shriberg et al. 2005, Noeth et al. 2002)

advantages of automatic intonation labeling: significantly less laborious and
time-consuming than manual transcription of intonation, but at the same time
yields comparable accuracy and consistency (Pitrelli et al. 1994, Grice et al.
1996, Yoon et al. 2004)
Pitch accents
slope,
relative syllable & nucleus
duration, Tilt, F0max
class
+acc (61.18%)
-acc (38.82%)
Average(%):
The goal

to propose a framework of an efficient and effective automatic labeling of
intonation involving detection and classification of pitch accents and phrase
boundaries with accuracy similar to that of the state-of-the-art approaches
and comparable to the levels of agreement among human labelers in manual
transcription of intonation
MLP
RBF
d. tree
DFA
81.79
81.76
77.06
74.89
81.65
82.14
81.2
80.94
81.72
81.95
79.13
77.92
Phrase boundaries
slope, relative syllable & nucleus
duration, Tilt, F0mean,
amplitude (rise)
Input features: acoustic, syntactic, lexical (one type as in Wightman & Ostendorf
1994 vs. different combinations vs. all, for example in Ananthakrishnan &
Narayanan 2008). The features can be considered as the main cues signaling
pitch accents and phrase boundaries. They are expected to discriminate well
between different categories of intonational events.
class
-b (72.59%)
+b (27.14%)
Average(%):
Statistical modeling techniques: neural networks (Ananthakrishnan & Narayanan
2008), classification trees (Bulyko & Ostendorf 2001), discriminant function
analysis (Demenko 1999), maximum entropy models (Sridhar et al. 2007) or
HMMs (Boidin & Boeffard 2008). Models’ complexity depends on the speaking
style, number of tasks to be performed and number of categories to be
recognized.
class. tree
NN, DFA
boundary vs.
no boundary
MLP
RBF
d. tree
DFA
81.99
79.29
84.33
90.05
76.26
81.55
78.6
74.04
79.13
80.42
81.47
82.05
Results of phrase boundary detection (test sample).
Classification of accents and boundaries
task
automatic
manual
pitch accent detection
phrase boundary detection
pitch accent type classification
boundary type recognition
joint pitch accent and boundary
tone type recognition
70% - 87%
80%-90%
84% - 94%
90% and higher
about 80%
64% - 95%
90% and higher
80%-90%
Pitch accents: classification tree (the best performance), average
accuracy
The numbers
brackets
show the % of average
correct
Pitch accof
ent81.67%.
s: classification
tree in
(the
best performance),
classifications
the (model
model vs.
chance level
accuracy of 81,by
67%
performance
vs.accuracy.
chance level accuracy)
78,6% (Rapp
1998)
amplitude (rise & fall)
relative mean, max., min.
F0, Tilt, Tilt amplitude,
direction of the movement
Features of the current approach
Input features – can be easily extracted from utterance’s F0, timing cues and
utterance’s labeling, most of them refer to relative values
- acoustic features: only F0 and duration; correlates of accentual
prominence and phrase boundaries, describe realization of different
categories of pitch accents and phrase boundaries
- lexical features: vowel & syllable type, lexical stress, syllable & word
boundary, distance to the next pause
Statistical modeling techniques: neural networks (NN: RBF, MLP), classification
trees (Quest, Loh & Shih 1997) and discriminant function analysis (DFA)
Automatic labeling involves two detection and two classification tasks:
- 1) detection of accentual prominence and 2) phrase boundaries
- 1) classification of pitch accents (5 categories) and 2) boundary tones
(4 classes) distinguished at the surface-phonological level

pitch accents and phrase boundaries are detected at the word level

small feature vectors constitute a compact representation of intonation (from
5 to 8 features determined at the vowel or syllable level); the representation
is characterized by low redundancy and wide coverage

four statistical models are designed each to perform one of the detection or
classification tasks

speech material: subset of sentences from the speech corpus used in the
Polish module of BOSS (Bonn Open Source Synthesis, Breuer et al. 2000)
labeled at the segmental and suprasegmental level (manually marked
position and category of pitch accents and phrase boundaries)

accented vs.
unaccented
Results of pitch accent detection (test sample);
numbers in brackets (column class) show chance-level accuracies.
State of the art
Performance: The overall accuracy varies with the task.
class. tree
NN, DFA
F0 extraction, preprocessing and parameterization carried out in Praat
H*L (71.01% vs. 36.83%)
L*H (86.16% vs. 10.25%)
LH* (70.19% vs. 28.9%)
HL* (91.67% vs. 20.94%)
LH*L (89.29% vs. 3.05%)
class. tree
(28 splits,
29 terminal
nodes)
Phrase boundaries: classification tree (the best performance), average
accuracy of 84.63%.
syllable-final F0,
mean F0,
direction of the movement,
distance to the next pause
2,.
2,?
5,.
5.?
class. tree
(9 splits,
10 terminal
nodes)
(70.13% vs. 20.42%)
(79.37% vs. 33.42%)
(92.25% vs. 37.93%)
(96.77% vs. 8.22%)
Example of an automatically labeled utterance (2nd tier):
#pSet
2,.
#pujs’t^s’em #d
o
H*L
#wuSka
LH*
#pijew~
2,? H*L
#goront^se
LH*
#kakao
#$ # #muvje
p i
L*H #$p 5,.
#t^awej
H*L
#rod^z’
n’e
#dobranot^s
HL*
#$p
Download