Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen University College Dublin Facial expression as an input annotation modality for affective speech-to-speech translation Introduction Expressive speech synthesis in human interaction Speech-to-speech translation: audiovisual input, affective state does not need to be predicted from text Facial expression as an input annotation modality for affective speech-to-speech translation Introduction Goal: Transferring paralinguistic information from source to target language by means of an intermediate, symbolic representation: facial expression as an input annotation modality. FEAST: Facial Expression-based Affective Speech Translation Facial expression as an input annotation modality for affective speech-to-speech translation output audio speech recognition content analysis translation video paralinguistic processing linguistic processing (mock-up) input System Architecture of FEAST expressive synthesis facial analysis emotion classification style selection Facial expression as an input annotation modality for affective speech-to-speech translation Face detection and analysis SHORE library for real-time face detection and analysis http://www.iis.fraunhofer.de/en/bf/bsy/produkte/shore/ Facial expression as an input annotation modality for affective speech-to-speech translation Emotion classification and style selection Aim of the facial expression analysis in FEAST system: a single decision regarding the emotional state of the speaker over each utterance Visual emotion classifier, trained on segments of the SEMAINE database, with input features from SHORE Facial expression as an input annotation modality for affective speech-to-speech translation Expressive speech synthesis Expressive unit-selection synthesis using the open-source synthesis platform MARY TTS German male voice dfki-pavoque-styles: Cheerful Depressed Aggressive Neutral Facial expression as an input annotation modality for affective speech-to-speech translation The SEMAINE database (semaine-db.eu) Audiovisual database collected to study natural social signals occurring in English conversations Conversations with four emotionally stereotyped characters: Poppy (happy, outgoing) Obadiah (sad, depressive) Spike (angry, confrontational) Prudence (even tempered, sensible) Facial expression as an input annotation modality for affective speech-to-speech translation Evaluation experiments 1. Does the system accurately classify emotion on the utterance level, based on the facial expression in the video input? 2. Do the synthetic voice styles succeed in conveying the target emotion category? 3. Do listeners agree with the cross-lingual transfer of paralinguistic information from the multimodal stimuli to the expressive synthetic output? Facial expression as an input annotation modality for affective speech-to-speech translation Experiment 1: Classification of facial expressions happy 88 6 0 6 sad 17 52 13 17 angry 4 17 67 13 neutral 31 8 23 38 happy sad angry neutral 535 utterances used for training, 107 for testing English video intended emotion Support Vector Machine (SVM) classifier trained on utterances of the male operators from the SEMAINE database predicted emotion Facial expression as an input annotation modality for affective speech-to-speech translation Experiment 2: Perception of expressive synthesis Perception experiment with 20 subjects Listen to natural and synthesised stimuli and choose which voice style describes the utterance best: Cheerful Depressed Aggressive Neutral Facial expression as an input annotation modality for affective speech-to-speech translation Experiment 2: Results 87 0 1 12 43 3 4 50 depressed 1 96 0 3 6 39 1 54 aggressive 0 1 97 2 1 0 72 27 neutral 8 18 3 71 12 6 12 70 depressed aggressive neutral cheerful depressed aggressive neutral German synthesis cheerful intended style German natural speech cheerful perceived style Facial expression as an input annotation modality for affective speech-to-speech translation Experiment 3: Adequacy for S2S translation Perceptual experiment with 14 bilingual participants 24 utterances from SEMAINE operator data and their corresponding translation in each voice style Listeners were asked to choose which German translation matches the original video best. Facial expression as an input annotation modality for affective speech-to-speech translation Examples - Poppy (happy) N C A D Facial expression as an input annotation modality for affective speech-to-speech translation Examples - Prudence (neutral) N C A D Facial expression as an input annotation modality for affective speech-to-speech translation Examples - Spike (angry) N C A D Facial expression as an input annotation modality for affective speech-to-speech translation Examples - Obadiah (sad) N C A D Facial expression as an input annotation modality for affective speech-to-speech translation cheerful 80 2 14 4 depressed 10 76 0 14 aggressive 17 1 82 0 neutral 56 5 6 33 depressed aggressive neutral English video/German TTS cheerful intended emotion in video Experiment 3: Results selected voice style Facial expression as an input annotation modality for affective speech-to-speech translation Conclusion Preserving the paralinguistic content of a message across languages is possible with significantly greater than chance accuracy Visual emotion classifier performed with an overall 63.5% accuracy Cheerful/happy is often mistaken for neutral (conditioned by the voice) Facial expression as an input annotation modality for affective speech-to-speech translation Future Work Extending the classifier to compute the prediction of the affective state of the user based on acoustic and prosodic analysis as well as facial expressions. Demonstration of the prototype system that takes live input through a webcamera and microphone. Integration of a speech recogniser and a machine translation component Questions?