poster

advertisement
Analysis and Recognition of Accentual Patterns
Agnieszka Wagner
Department of Phonetics, Institute of Linguistics, Adam Mickiewicz University in Poznań, Poland
wagner@amu.edu.pl
Motivation




Classification of pitch accent types
growing interest in speech technologies for which speech corpora
annotated at the segmental and suprasegmental level are required
suprasegmental annotation provides information on utterance’s intonation
which is significant to the output quality of speech applications (e.g.,
Gallwitz et al. 2002, Shriberg et al. 2005, Noeth et al. 2002)
the use of a specific pitch accent pattern serves communication purposes
automatic recognition of pitch accent patterns: significantly less laborious
and time-consuming than manual labeling, but yields comparable accuracy
and consistency (Pitrelli et al. 1994, Grice et al. 1996, Yoon et al. 2004)
The goal


to propose a framework of analysis of pitch accent patterns at the acousticphonetic level by identifying acoustic cues signaling prominent syllables
and different pitch accent types distinguished at the surface-phonological
level
to propose a framework of an efficient and effective automatic labeling of
accentual patterns of utterances involving detection and classification of
pitch accents with accuracy similar to that of the state-of-the-art
approaches and comparable to the levels of agreement among human
labelers in manual transcription of accentual patterns
caused by intrinsic properties of phonemes and especially vowels (in case of
duration features)
- two acoustic features vectors are defined – one for the analysis and
detection of prominence and the other – for the analysis and
labeling of different pitch accent types
Statistical modeling techniques: neural networks (NN: RBF, MLP) trained
using the back-propagation algorithm and/or conjugate gradient descend
method, classification trees (Quest, Loh & Shih 1997) and discriminant
function analysis (DFA). The models are designed semi-automatically using
Statistica 6.0 software. For each task models of a different complexity trained
with different prior classification probabilities are tested, and the most efficient
ones are selected for further optimization.
Statistical modeling techniques: neural networks (Ananthakrishnan &
Narayanan 2008), classification trees (Bulyko & Ostendorf 2001), discriminant
function analysis (Demenko 1999), maximum entropy models (Sridhar et al.
2007) or HMMs (Boidin & Boeffard 2008).
Models’ complexity depends on the speaking style, number of tasks to be
performed and number of categories to be recognized.






prominence detection is performed at the word level i.e., only lexically
stressed syllables are taken into account
two acoustic feature vectors constitute a representation of accentual
patterns at the acoustic-phonetic level which is compact (altogether 13
features determined at the vowel or syllable level), has low redundancy
and wide coverage
from the lower-level acoustic-phonetic representation a higher-level
surface-phonological description of accentual patterns can be easily and
reliably derived
two statistical models are designed: one to perform the detection task and
the other to perform the classification
speech material: subset of sentences from the speech corpus used in the
Polish module of BOSS (Bonn Open Source Synthesis, Breuer et al. 2000)
labeled at the segmental and suprasegmental level (manually marked
position and category of pitch accents and phrase boundaries)
F0 extraction, preprocessing and parameterization carried out in Praat
pitch accent (prominence) detection
pitch accent type classification
class
H*L (36.86%)
L*H (10.25%)
LH* (28.9%)
HL* (20.94%)
LH*L (3.05%)
Average (%):
manual
70% - 87%
about 80%
80%-90%
64% - 95%
Features of the current approach
Features for the analysis and labeling:
- can be easily extracted from utterance’s F0, timing cues and lexical
features (vowel/syllable type, phone/syllable/word boundary,
lexical stress) and exclude intensity features
- most of them refer to relative values i.e., values normalized against
variation due to prosodic structure (in case of F0 features) or
RBF
(8:82:5)
78.99
70.21
85.66
89.58
39.29
72.75
d. tree
(28 splits)
71.01
86.17
70.19
91.67
89.29
81.67
DFA
71.89
70.21
80.38
88.54
85.71
79.35
Results of pitch accent type classification (test sample);
numbers in brackets (column class) show chance-level accuracies.
Example of an automatically labeled utterance (2nd tier):
#$ #pSet
p
automatic
MLP
(8:15:5)
76.63
77.66
83.02
89.58
60.71
77.52
H*L
L*H
LH*
HL*
LH*L
decision tree,
NN, DFA
Prominence detection
Performance of the existing models:
task
amplitude (rise & fall)
relative mean, max., min. F0,
Tilt, Tilt amplitude,
direction of the movement
Automatic labeling involves two tasks:
- detection of accentual prominence
- classification of pitch accents (5 categories) distinguished at the
surface-phonological level
State of the art
Features used in the analysis and labeling of accentual patterns can be
acoustic, syntactic, lexical (one type as in Wightman & Ostendorf 1994 vs.
different combinations vs. all, for example in Ananthakrishnan & Narayanan
2008). The features can be considered as the main cues signaling prominence
and are expected to discriminate well between different pitch accent
categories.
On the surface-phonological level pitch accent types are distinguished on the
basis of melodic properties such as direction, range and slope of the distinctive
pitch movement, and its temporal alignment with the accented vowel.
Pitch accents are described in terms of discrete bi-tonal categories: LH*, L*H,
H*L, HL*, LH*L, where L marks a lower and H a higher tonal target, and the
asterisk indicates which of the two tones is aligned with the accented vowel.
This representation encodes both melodic and functional aspects of prosody.
slope,
relative syllable & nucleus
duration, Tilt, F0max
class
+acc (61.18%)
-acc (38.82%)
Average(%):
MLP
(5:17:1)
81.79
decision tree,
NN, DFA
prominent vs.
non-prominent
RBF
d. tree
(5:82:1) (34 splits)
81.76
77.06
#d
#pujs’t^s’em o
#wuSka
H*L
LH*
#pijew~
H*L
#goront^se
LH*
#kakao
L*H
#$
# #muvje
p
i
#t^awej
H*L
#rod^z’
n’e
#dobranot^s
#$
p
HL*
Summary
DFA
74.89
81.65
82.14
81.2
80.94
81.72
81.95
79.13
77.92
Results of pitch accent detection (test sample);
numbers in brackets (column class) show chance-level accuracies.
Analysis: Accentual patterns can be effectively analyzed in terms of a
compact and simple acoustic-phonetic representation consisting of 13 features
derived from utterance's F0, timing cues and lexical features.
Recognition: High average accuracy (between 72.75% and 81.95%) in the
detection and classification tasks - the best models yield accuracy comparable
to that reported in other studies and approaching the levels of agreement
among human annotators in manual labeling of accentual patterns.
Download