An Audiovisual Feedback System for Acquiring L2 Pronunciation and L2 Prosody Grażyna Demenko1, Agnieszka Wagner1, Natalia Cylwik1 and Oliver Jokisch2 (1) Adam Mickiewicz University, Institute of Linguistics, Department of Phonetics, Poznań, Poland (2) TU Dresden, Laboratory of Acoustics and Speech Communication, Dresden, Germany An increasing application of speech technology solutions to foreign language learning and the acknowledgement that proper pronunciation and prosody are as important for nonnative speech intelligibility as other linguistic skills have led to the development of a new discipline known under the name of Computer-Assisted Pronunciation Training (CAPT). Fig.1 Curriculum The exercises are devised in order to test and practice prosody in smaller and larger syntactic units. At the word level suprasegmental identification is devoted mainly to: perception and production of regular and irregular lexical stress and foot structure as well as types of nuclear accents, duration, intensity, identification of mono-, di-, tri-, four-syllable words, prosodic word, enclitics, proclitics, linking At the level of simple and complex sentences exercises consist in: production and recognition of different types of sentences, i.e. declaratives, commands, wh-questions, etc. identification and production of emphatic stress, relating focus with meaning performing communicative functions with focus (e.g. showing emotions, disagreement, calling attention to new information) perception and production of contrastive pitch patterns conveying various meanings e.g., fall (finality, authority), rise (unfinished, insinuating, tentative) Requirements on an intelligent tutoring system In order to be effective CAPT systems should meet the following requirements: allow for training of both pronunciation and prosody: weak pronunciation can sometimes preclude full intelligibility of speech, but prosody is important too, because it helps listeners to process the segmental content identify precisely the location and type of the error provide scoring of learner’s utterance that gives immediate information on the overall output quality provide effective feedback via different channels (visual, aural, also descriptive, contrastive feedback) – the feedback should be relevant to the type of error made by the learner, easy to interpret and constructive, so that the learner understands how to self-correct and get improvement keep track of the learner’s performance, so that identification of features that should be practiced is possible and the learner’s progress can be monitored user-friendliness - it should be clear how to interpret displays and evaluate results Prosody training Feedback Fig.2 an accurate visual representation of student’s and native speaker’s pitch contour in real time paired with auditory feedback pitch contours are stylized using the Pitch Line software to provide a continuous representation and to ensure that only perceptually relevant pitch variations (i.e. the macroprosodic component of the pitch contour) are displayed relevant portions of the pitch contour (i.e. those corresponding to accented and phrase-boundary words) are described parametrically with regard to four perceptually significant features: direction, steepness, range of the distinctive pitch movement and its temporal alignment with the onset of the accented vowel a higher-level surface-phonological representation of the contour is derived from the parametric description; It is in terms of discrete categories of pitch accents and boundary tones, encodes melodic and functional aspects of prosody and unlike strictly phonological representations, it makes no distinction between linguistic and paralinguistic functions of prosody. Intonation contours which have different representations at the surface-phonological level convey different meanings. automatic assessment: qualitative (in terms of the pitch accent and boundary tone categories) and quantitative (parametric) results are displayed on a color scale (red - green) the learner is instructed to compare his/her realization to that of the native speaker – quantitative measurements of both realizations are provided as a support The Euronounce project Intelligent Language Tutoring System with Multimodal Feedback Functions (acronym Euronounce) aims at creating L2 pronunciation and prosody teaching software. The project focuses on Slavonic (Polish, Slovak, Russian, Czech)-German language pairs. The Euronounce project was preceded by earlier projects carried out between 2005 and 2007. As a result, an audiovisual software AzAR (German acronym for Automat for Accent Reduction) aimed at teaching Russians German pronunciation was created. AzAR architecture separates the structure from the content, which enables adaptation of the system to a new language or set of exercises. Following the baseline developed in these projects the Euronounce project, beside new language pairs, adds also suprasegmental exercises. Fig.3 AzAR AzAR is a knowledge-based system: it focuses on specific language pairs and uses expert’s knowledge on typical errors made by L2 learners caused by interference with their native language (L1) phonology and phonetics. AzAR includes an extensive curriculum (defined by an expert) for the production and perception training of difficult segmental contrasts. The learner’s task is to listen to the utterance (a minimal pair, sentence or fragment of a text) produced by the reference voice and to repeat it (in the production scenario) or to discriminate between words in a minimal pair realized by the reference voice (in the perception scenario). In the first case the system gives a multimodal (visual and audio) feedback – learner’s utterance is displayed and scored. An oscillogram of the model utterance is presented simultaneously to allow for comparison. The learner can listen to his/her own realization of the utterance and to that produced by the reference voice. Fig. 1 illustrates the template for the production exercise in which realization of the vowel contrast (/I/ as in “bitten” vs. /i:/ as in “bieten”) can be practiced. Features of the feedback system the software uses HMM-based speech recognition and speech signal analysis on the learner’s input which makes a visual and aural comparison of the user’s own performance with that of the reference voice possible automatic error detection on the phonemic level: all uttered phones are marked using a color scale (on which green indicates good and red – bad pronunciation) an additional visual mode includes animated visualization of the vocal tract (lips area and articulators movements) and a formants graph for particular phones in order to ensure that the information provided by the visual feedback is useful for the learner a tutorial (Fig.2 & 3) is provided – it gives introduction to acoustic and articulatory phonetics and explains how to interpret the acoustic displays. For each exercise in the curriculum a passage containing information on the classification, features and articulation of the phone is provided as well as a sagittal slice of the vocal tract during the phone production and pictures of the lip area showing also tongue position. #mam 5,. #ją H*L #mam 5,. H*L #co 5,. #ją #co 5,? #sukienkę #$p HL* #$p 5,. #sukienkę #$p L*H? #$p HL* 5,. HL* Fig.4: A Polish utterance realized by native speaker (top) and learner with L1 German (bottom). The 2nd tier contains the surface-phonological representation of the intonation contour.