Automatic Labeling of Intonation Using Acoustic and Lexical Features Agnieszka Wagner Department of Phonetics, Institute of Linguistics, Adam Mickiewicz University in Poznań, Poland wagner@amu.edu.pl Motivation Determination of accentuation and phrasing growing interest in speech technologies such as speech synthesis or recognition for which speech corpora annotated at the segmental and suprasegmental level are required suprasegmental annotation provides information on utterance’s intonation which is significant to the output quality of speech applications (e.g., Gallwitz et al. 2002, Shriberg et al. 2005, Noeth et al. 2002) advantages of automatic intonation labeling: significantly less laborious and time-consuming than manual transcription of intonation, but at the same time yields comparable accuracy and consistency (Pitrelli et al. 1994, Grice et al. 1996, Yoon et al. 2004) Pitch accents slope, relative syllable & nucleus duration, Tilt, F0max class +acc (61.18%) -acc (38.82%) Average(%): The goal to propose a framework of an efficient and effective automatic labeling of intonation involving detection and classification of pitch accents and phrase boundaries with accuracy similar to that of the state-of-the-art approaches and comparable to the levels of agreement among human labelers in manual transcription of intonation MLP RBF d. tree DFA 81.79 81.76 77.06 74.89 81.65 82.14 81.2 80.94 81.72 81.95 79.13 77.92 Phrase boundaries slope, relative syllable & nucleus duration, Tilt, F0mean, amplitude (rise) Input features: acoustic, syntactic, lexical (one type as in Wightman & Ostendorf 1994 vs. different combinations vs. all, for example in Ananthakrishnan & Narayanan 2008). The features can be considered as the main cues signaling pitch accents and phrase boundaries. They are expected to discriminate well between different categories of intonational events. class -b (72.59%) +b (27.14%) Average(%): Statistical modeling techniques: neural networks (Ananthakrishnan & Narayanan 2008), classification trees (Bulyko & Ostendorf 2001), discriminant function analysis (Demenko 1999), maximum entropy models (Sridhar et al. 2007) or HMMs (Boidin & Boeffard 2008). Models’ complexity depends on the speaking style, number of tasks to be performed and number of categories to be recognized. class. tree NN, DFA boundary vs. no boundary MLP RBF d. tree DFA 81.99 79.29 84.33 90.05 76.26 81.55 78.6 74.04 79.13 80.42 81.47 82.05 Results of phrase boundary detection (test sample). Classification of accents and boundaries task automatic manual pitch accent detection phrase boundary detection pitch accent type classification boundary type recognition joint pitch accent and boundary tone type recognition 70% - 87% 80%-90% 84% - 94% 90% and higher about 80% 64% - 95% 90% and higher 80%-90% Pitch accents: classification tree (the best performance), average accuracy The numbers brackets show the % of average correct Pitch accof ent81.67%. s: classification tree in (the best performance), classifications the (model model vs. chance level accuracy of 81,by 67% performance vs.accuracy. chance level accuracy) 78,6% (Rapp 1998) amplitude (rise & fall) relative mean, max., min. F0, Tilt, Tilt amplitude, direction of the movement Features of the current approach Input features – can be easily extracted from utterance’s F0, timing cues and utterance’s labeling, most of them refer to relative values - acoustic features: only F0 and duration; correlates of accentual prominence and phrase boundaries, describe realization of different categories of pitch accents and phrase boundaries - lexical features: vowel & syllable type, lexical stress, syllable & word boundary, distance to the next pause Statistical modeling techniques: neural networks (NN: RBF, MLP), classification trees (Quest, Loh & Shih 1997) and discriminant function analysis (DFA) Automatic labeling involves two detection and two classification tasks: - 1) detection of accentual prominence and 2) phrase boundaries - 1) classification of pitch accents (5 categories) and 2) boundary tones (4 classes) distinguished at the surface-phonological level pitch accents and phrase boundaries are detected at the word level small feature vectors constitute a compact representation of intonation (from 5 to 8 features determined at the vowel or syllable level); the representation is characterized by low redundancy and wide coverage four statistical models are designed each to perform one of the detection or classification tasks speech material: subset of sentences from the speech corpus used in the Polish module of BOSS (Bonn Open Source Synthesis, Breuer et al. 2000) labeled at the segmental and suprasegmental level (manually marked position and category of pitch accents and phrase boundaries) accented vs. unaccented Results of pitch accent detection (test sample); numbers in brackets (column class) show chance-level accuracies. State of the art Performance: The overall accuracy varies with the task. class. tree NN, DFA F0 extraction, preprocessing and parameterization carried out in Praat H*L (71.01% vs. 36.83%) L*H (86.16% vs. 10.25%) LH* (70.19% vs. 28.9%) HL* (91.67% vs. 20.94%) LH*L (89.29% vs. 3.05%) class. tree (28 splits, 29 terminal nodes) Phrase boundaries: classification tree (the best performance), average accuracy of 84.63%. syllable-final F0, mean F0, direction of the movement, distance to the next pause 2,. 2,? 5,. 5.? class. tree (9 splits, 10 terminal nodes) (70.13% vs. 20.42%) (79.37% vs. 33.42%) (92.25% vs. 37.93%) (96.77% vs. 8.22%) Example of an automatically labeled utterance (2nd tier): #pSet 2,. #pujs’t^s’em #d o H*L #wuSka LH* #pijew~ 2,? H*L #goront^se LH* #kakao #$ # #muvje p i L*H #$p 5,. #t^awej H*L #rod^z’ n’e #dobranot^s HL* #$p