Analysis and Recognition of Accentual Patterns Agnieszka Wagner Department of Phonetics, Institute of Linguistics, Adam Mickiewicz University in Poznań, Poland wagner@amu.edu.pl Motivation Classification of pitch accent types growing interest in speech technologies for which speech corpora annotated at the segmental and suprasegmental level are required suprasegmental annotation provides information on utterance’s intonation which is significant to the output quality of speech applications (e.g., Gallwitz et al. 2002, Shriberg et al. 2005, Noeth et al. 2002) the use of a specific pitch accent pattern serves communication purposes automatic recognition of pitch accent patterns: significantly less laborious and time-consuming than manual labeling, but yields comparable accuracy and consistency (Pitrelli et al. 1994, Grice et al. 1996, Yoon et al. 2004) The goal to propose a framework of analysis of pitch accent patterns at the acousticphonetic level by identifying acoustic cues signaling prominent syllables and different pitch accent types distinguished at the surface-phonological level to propose a framework of an efficient and effective automatic labeling of accentual patterns of utterances involving detection and classification of pitch accents with accuracy similar to that of the state-of-the-art approaches and comparable to the levels of agreement among human labelers in manual transcription of accentual patterns caused by intrinsic properties of phonemes and especially vowels (in case of duration features) - two acoustic features vectors are defined – one for the analysis and detection of prominence and the other – for the analysis and labeling of different pitch accent types Statistical modeling techniques: neural networks (NN: RBF, MLP) trained using the back-propagation algorithm and/or conjugate gradient descend method, classification trees (Quest, Loh & Shih 1997) and discriminant function analysis (DFA). The models are designed semi-automatically using Statistica 6.0 software. For each task models of a different complexity trained with different prior classification probabilities are tested, and the most efficient ones are selected for further optimization. Statistical modeling techniques: neural networks (Ananthakrishnan & Narayanan 2008), classification trees (Bulyko & Ostendorf 2001), discriminant function analysis (Demenko 1999), maximum entropy models (Sridhar et al. 2007) or HMMs (Boidin & Boeffard 2008). Models’ complexity depends on the speaking style, number of tasks to be performed and number of categories to be recognized. prominence detection is performed at the word level i.e., only lexically stressed syllables are taken into account two acoustic feature vectors constitute a representation of accentual patterns at the acoustic-phonetic level which is compact (altogether 13 features determined at the vowel or syllable level), has low redundancy and wide coverage from the lower-level acoustic-phonetic representation a higher-level surface-phonological description of accentual patterns can be easily and reliably derived two statistical models are designed: one to perform the detection task and the other to perform the classification speech material: subset of sentences from the speech corpus used in the Polish module of BOSS (Bonn Open Source Synthesis, Breuer et al. 2000) labeled at the segmental and suprasegmental level (manually marked position and category of pitch accents and phrase boundaries) F0 extraction, preprocessing and parameterization carried out in Praat pitch accent (prominence) detection pitch accent type classification class H*L (36.86%) L*H (10.25%) LH* (28.9%) HL* (20.94%) LH*L (3.05%) Average (%): manual 70% - 87% about 80% 80%-90% 64% - 95% Features of the current approach Features for the analysis and labeling: - can be easily extracted from utterance’s F0, timing cues and lexical features (vowel/syllable type, phone/syllable/word boundary, lexical stress) and exclude intensity features - most of them refer to relative values i.e., values normalized against variation due to prosodic structure (in case of F0 features) or RBF (8:82:5) 78.99 70.21 85.66 89.58 39.29 72.75 d. tree (28 splits) 71.01 86.17 70.19 91.67 89.29 81.67 DFA 71.89 70.21 80.38 88.54 85.71 79.35 Results of pitch accent type classification (test sample); numbers in brackets (column class) show chance-level accuracies. Example of an automatically labeled utterance (2nd tier): #$ #pSet p automatic MLP (8:15:5) 76.63 77.66 83.02 89.58 60.71 77.52 H*L L*H LH* HL* LH*L decision tree, NN, DFA Prominence detection Performance of the existing models: task amplitude (rise & fall) relative mean, max., min. F0, Tilt, Tilt amplitude, direction of the movement Automatic labeling involves two tasks: - detection of accentual prominence - classification of pitch accents (5 categories) distinguished at the surface-phonological level State of the art Features used in the analysis and labeling of accentual patterns can be acoustic, syntactic, lexical (one type as in Wightman & Ostendorf 1994 vs. different combinations vs. all, for example in Ananthakrishnan & Narayanan 2008). The features can be considered as the main cues signaling prominence and are expected to discriminate well between different pitch accent categories. On the surface-phonological level pitch accent types are distinguished on the basis of melodic properties such as direction, range and slope of the distinctive pitch movement, and its temporal alignment with the accented vowel. Pitch accents are described in terms of discrete bi-tonal categories: LH*, L*H, H*L, HL*, LH*L, where L marks a lower and H a higher tonal target, and the asterisk indicates which of the two tones is aligned with the accented vowel. This representation encodes both melodic and functional aspects of prosody. slope, relative syllable & nucleus duration, Tilt, F0max class +acc (61.18%) -acc (38.82%) Average(%): MLP (5:17:1) 81.79 decision tree, NN, DFA prominent vs. non-prominent RBF d. tree (5:82:1) (34 splits) 81.76 77.06 #d #pujs’t^s’em o #wuSka H*L LH* #pijew~ H*L #goront^se LH* #kakao L*H #$ # #muvje p i #t^awej H*L #rod^z’ n’e #dobranot^s #$ p HL* Summary DFA 74.89 81.65 82.14 81.2 80.94 81.72 81.95 79.13 77.92 Results of pitch accent detection (test sample); numbers in brackets (column class) show chance-level accuracies. Analysis: Accentual patterns can be effectively analyzed in terms of a compact and simple acoustic-phonetic representation consisting of 13 features derived from utterance's F0, timing cues and lexical features. Recognition: High average accuracy (between 72.75% and 81.95%) in the detection and classification tasks - the best models yield accuracy comparable to that reported in other studies and approaching the levels of agreement among human annotators in manual labeling of accentual patterns.