MultiModal Dialogue Systems and Other Research Projects at KTH Rolf Carlson, CTT, KTH www.speech.kth.se KTH - Kungliga tekniska högskolan Department of Speech, Music and Hearing Kungliga tekniska högskolan, KTH • • • • 10000 1500 3000 800 undergraduate students graduate students staff professors and teachers The KTH speech group - Early days Gunnar Fant and OVE I 1953 Ove II, 1958 1961 1962 Show OVE I CTT - Centrum för talteknologi Research Areas • Speech Production • Speech Perception • Communication Aids • Multimodal Synthesis • Speech Understanding • Speaker Characterization • Spoken Language • Interactive Dialog Systems KTH/TTS history • 1967, Digitally controlled OVE III • 1974, Rule-based system RULSYS – transformation rules • 1979, Mobile text-to-speech system – used by a non-vocal child • 1982, Portable TTS • 2004 – Data-driven multimodal synthesis – Synthesis of emotions – Synthesis of breaks Synthesis Original Formant Synthesis Emotions natural natural synthesis Neutral synthesis Happy Sad natural synthesis Angry Angry natural synthesis Multimodal Synthesis Combining interior and exterior registration Example of resynthsis Show multimodal synthesis % correct Results for VCV-words (hearing impaired subjects) 100 90 80 70 60 50 40 30 20 10 0 Audio alone Synthetic face Natural face + rule-synthesis Natural voice The Synface telephone Real user tests Multi-modal dialog systems The Waxholm Project • tourist information • Stockholm archipelago • time-tables, hotels, hostels, camping and dining possibilities. • mixed initiative dialogue • speech recognition • multimodal synthesis • graphic information • pictures, maps, charts and time-tables. User answers to questions? The answers to the question: “What weekday do you want to go?” (Vilken veckodag vill du åka?) • 22% • 11% • 11% • 7% • 6% Friday (fredag) I want to go on Friday (jag vill åka på fredag) I want to go today (jag vill åka idag) on Friday (på fredag) I want to go a Friday (jag vill åka en fredag) • - are there any hotels in Vaxholm? (finns det några hotell i Vaxholm) User answers to questions? The answers to the question: “What weekday do you want to go?” (Vilken veckodag vill du åka?) • 22% • 11% • 11% • 7% • 6% Friday (fredag) I want to go on Friday (jag vill åka på fredag) I want to go today (jag vill åka idag) on Friday (på fredag) I want to go a Friday (jag vill åka en fredag) • - are there any hotels in Vaxholm? (finns det några hotell i Vaxholm) Examples of questions and answers Hur ofta åker du utomlands på semestern? jag åker en gång om året kanske jag åker ganska sällan utomlands på semester jag åker nästan alltid utomlands under min semester jag åker ungefär 2 gånger per år utomlands på semester jag åker utomlands nästan varje år jag åker utomlands på semestern varje år jag åker utomlands ungefär en gång om året jag är nästan aldrig utomlands en eller två gånger om året en gång per semester kanske en gång per år ungefär en gång per år åtminståne en gång om året nästan aldrig Hur ofta reser du utomlands på semestern? jag reser en gång om året utomlands jag reser inte ofta utomlands på semester det blir mera i arbetet jag reser reser utomlands på semestern vartannat år jag reser utomlands en gång per semester jag reser utomlands på semester ungefär en gång per år jag brukar resa utomlands på semestern åtminståne en gång i året en gång per år kanske en gång vart annat år varje år vart tredje år ungefär nu för tiden inte så ofta varje år brukar jag åka utomlands Results no no reuse 4% 2%answer other 24% reuse 52% 18% ellipse The Waxholm system There Information Information Information Which areWhen IIs IWaxholm lots am Which think day This itWhere looking possible of Ido about about of Where want about is Iboats hotels the want the Thank ais can Thank table hotels the evening to The for week shown the to from isto go are Ieat restaurants boats Waxholm? you find city hotels of go tomorrow you is in do Stockholm in the to on boats shown too Waxholm? hotels? Waxholm? you to Waxholm boats... this inWaxholm Waxholm want depart? in map inWaxholm to this toWaxholm go? table is on a Friday, Fromis At shown where shown whatin do time inthis you this do table want table you to want go to go? Dialog systems at KTH The August system • • • • • • Stockholm (events and general information) Yellow pages KTH and speech technology August Strindberg Greetings and social utterances Comments about the system capabilities and the discourse Knowledge sources - Evaluation Acoustic analysis Syntactic analysis Semantic analysis Dialog state Dialog Context Confidence Expectation Filter Shallow semantic analysis • Input – word sequences – semantic features from lexicon • Output – Acceptable utterance? yes/no – Predicted domain • strindberg, stockholm, yellow pages….. – Feature:value representation • object:restaurant, place:mariatorget • Trained on tagged N-best lists and lexicon The set-up in Kulturhuset A sample video of the system environment There MyThere My The life IHow life Can We are there can information cannot are cannot are you over tell 24,000 any now recommend be you 700 restaurants be measured in about restaurants is islands the shown ainin in IAre You did Are What IsExcellent But must there Can not Where we Hello can are we understand you in Where? specify Hello! one Kulturhuset? are Iyou August as are see ask close always! here! today? we? about? me? ameasured street by? that! restaurant the restaurants terms terms centre Stockholm on in of Blekingegatan? onStockholm of in days of the the days in Stockholm map and archipelago Stockholm neighbourhood and years! years! The August database September 1998 - February 1999: 10,058 utterances (approximately 15 hours of speech) were manually checked, transcribed and analyzed children 24% children 22% men 50% women 26% men 55% women 23% 2685 users 10,058 utterances What do you say to August? • Child • Woman 1 • Woman 2 What …….. ? • 334 utterances include “what” – only 75 have “what” in initial position • 99 “what is your name” – all in final utterance position – only 13 initiate an utterance intro what ……... An example of a repetitive sequence The utterance ”Vad heter kungen?” (What is the name of the king?) as original input (top) and repeated twice by the same user Percentage of all repetitions Features in repetition 50 40 adults children 30 20 10 0 more increased shifting of clearly loudness focus articulated The August system People IWhat IStrindberg IYes, Over call The can How come Strindberg The Perhaps myself answer that information who amany from Royal million was live we Strindberg, questions the was people Institute ain will people smart glass married ishere? shown live thing about of houses live but in I Yes, When it What Do You do might you Thank you Good were are is was like your be do welcome! bye! you that you! born for itdepartment name? born? we ameet inliving? will! 1849 Strindberg, ofdon’t Speech, should in the really three Technology! Stockholm? on soon Stockholm not KTH Music tothe have throw say! times! again! map and and a surname Stockholm stones area Hearing Dialog systems at KTH Dialog systems at KTH Dialog systems at KTH The HIGGINS domain Error handling in dialog systems Initial experiments • Studies on human-human conversation • The Higgins domain (similar to Map Task) • Using ASR in one direction to elicit error handling behaviour User Speaks ASR Listens Vocoder Reads Speaks Operator Non-understanding error recovery • Results show that humans tend not to signal nonunderstanding: O: U: O: Do you see a wooden house in front of you? YES CROSSING ADDRESS NOW (I pass the wooden house now) Can you see a restaurant sign? • This leads to – Increased experience of task success – Faster recovery from non-understanding • Skantze, G. (2003). Exploring human error handling strategies: implications for spoken dialogue systems. Prediction of Upcoming Swedish Prosodic Boundaries by Swedish and American Listeners Rolf Carlson, Julia Hirschberg, Marc Swerts Questions • Are listeners able to predict the occurrence of upcoming boundarieson based on acoustic and prosodic features • Are listeners able to distinguish different degrees of upcoming boundary strength Experiment • Spontaneous utterance fragments presented to listeners • Questions: – followed by a prosodic break? – strength of the break? • Answers compared to labeled data Database • The speech corpus: an interview of a Swedish female politician (Swedish Radio) • The interview was prosodically labeled by three independent researchers for boundary presence and strength • Majority voting strategy used to resolve disagreements. Stimuli • 60 utterance fragments all preceded the word “och” (and) in their original context (each about 2 seconds long) • additional 60 shortened versions consisting of only the final word of each fragment Subjects • 13 Swedish subjects (SW), students of logopedics from Umeå University, Sweden • 29 American subjects (AM), staff and students at Columbia University, USA Perceived upcoming boundary strength perceived boundary strength . 5 one word (AM) 4 one word (SW) 3 2 seconds (AM) 2 2 seconds (SW) 1 no break weak break strong break Subject scores on a 5-point scale. Data grouped according to expert labeled boundary strength ( no,weak or strong break), fragment size and native language Word in Isolation and 2 Seconds Fragment American subjects (AM) boundary strength (2 seconds) . boundary strength (2 seconds) . Swedish subjects (SW) 5 4 3 2 1 1 2 3 4 boundary strength (one word ) 5 5 4 3 2 1 1 2 3 4 5 boundary strength (one word ) Correlation between perceived upcoming boundary strength for each word in isolation and in a 2 seconds fragment for the Swedish and American subjects Regression coefficient r = 0,89 (SW) and r= 0,80 (AM) Boundary strength . Language Difference 5 4 3 2 1 AM Word SW 2 Seconds Perceived upcoming boundary strength by stimulus length. Data grouped according to subject’s native language American (AM) and Swedish (SW). Results • Both native speakers of Swedish and of standard American English were indeed able to predict whether or not a boundary (of a particular strength) followed the fragment • Suggesting that prosodic rather than lexicogrammatical information was being used as a primary cue. Creaky Voice 100,00 Creak (%) 80,00 60,00 40,00 20,00 0,00 1 - 1,5 1,5 - 2,5 2,5 - 3,5 3,5 - 5,0 Judged boundary strength (words) American Swedish Number of stimuli with creaky voice (in %) for different judged boundary strength intervals (one word) Perceptual results correlate with • • • • F0 median in last syllable F0 slope in last syllable Creaky voice Durational cues Turn taking / Interruption CHIL "Computers in the Human interaction Loop" • Integrated Project under the European Commission's Sixth Framework Programme. • Coordinated by Universität Karlsruhe (TH) and the Fraunhofer Institute IITB. • CHIL was launched on January, 1st 2004. http://chil.server.de/ • • • • • • • • • • • • • • • DaimlerChrysler AG, Group Dialogue Systems, Germany ELDA, Evaluations and Language resources Distribution Agency, France IBM Ceska Republika, Jzech Republic RESIT, Research and Education Society in Information Technologies, Greece INRIA (Institut National de Recherche en Informatique et en Automatique), Project GRAVIR, France IRST (Instituto Trentino di Cultura), Italy KTH (Kungl Tekniska Högskolan), Sweden. CNRS, LIMSI (Centre National de la Recherche Scientifique through its Laboratoire d'Informatique pour la mécanique et les sciences de l'ingénieur), France TUE (Technische Universiteit Eindhoven), The Netherlands IPD, Universität Karlsruhe (TH) through its Institute IPD, Germany UPC, Universitat Politècnica de Catalunya, Spain Universität Karlsruhe (TH), Interactive Systems Labs, Germany Fraunhofer Institut für Informations- und Datenverarbeitung (IITB), Karlsruhe, Germany Stanford University, USA CMU, Carnegie Mellon University, USA Challenge • The objective – to create environments in which computers serve humans who focus on interacting with other humans as opposed to having to attend to and being preoccupied with the machines themselves. – Instead of computers operating in an isolated manner, and Humans in the loop of computers we will put Computers in the Human Interaction Loop (CHIL). • Computer Services – models of humans and the state of their activities and intentions. Based on the understanding of the human perceptual context, CHIL computers are enabled to provide helpful assistance implicitly, requiring a minimum of human attention or interruptions CHIL - Services • Memory Jog (MJ). – It helps the attendees by providing information related to the development of the event (meeting/lecture) and to the participants. MJ provides context- and content-aware information pull and push, both personalized and public. • Attention Cockpit (AC). – AC monitors the attention and interest level of participants, supporting individuals who want more or less involvement in the discussion. It can also inform the Socially-Supportive Workspaces about the attentional state of the participants. • Connector (Connector). – Context-aware connecting services ensure that two parties are connected with each other at the right place, time and by the best media, when it is most appropriate and desirable for both parties to be connected. Thank you • Natural speech combined with synthetic head controlled by the audio The End Perceptual Judgments of Pitch Range Rolf Carlson, Kjell Elenius, Marc Swerts • Question – To what extent are listeners able to judge where a particular utterance fragment is located in a speaker’s pitch range. • Corpus – 498 speakers from the Swedish SpeeCon database – A cumulative distribution of the F0 easurements for each speaker was calculated based on 314 prompted utterances. 50% stimulus 25% stimulus 50 45 75 F0 (Semitone) Cumulative distribution 100 75% stimulus 50 40 35 25 30 0 25 0 50 100 F0 (Hz) 150 0 25 50 Speakers (N=498) 75 100 Experiment • 100 stimuli, selected from a group of 50 different speakers which had an F0 median distribution representative of the distribution in the database as a whole. • The fragments were presented to subjects whom are asked to estimate whether the fragment is located in the lower or higher part of that speaker’s range. Hypothesis 5 25 % H1 75 % H1 25 % H2 75 % H2 25 % H3 75 % H3 4 Percept Hypothesis H1: Listeners can make an estimate of a speaker’s range and where an utterance is positioned in this range. Hypothesis H2: Listeners can not make an estimate of a speaker’s range and make an absolute judgment of an utterance F0 irrespective of speaker characteristics. Hypothesis H3: Listeners can estimate the speaker’s gender and make an estimate where an utterance is positioned in the gender range. 3 2 1 25 30 35 40 45 50 Speaker Median f0 (semitone) Judgments of pitch range 25% 75% 5,0 5,0 4,0 Percept Percept 4,0 3,0 3,0 2,0 2,0 1,0 1,0 25 % 75 % Average range judgments for 25% and 75% stimuli. 25 30 35 40 45 Speaker Median f0 (semitone) 50 Judgments of pitch range for 25% and 75% stimuli arranged per speaker . Conclusion 5,0 Percept 4,0 3,0 2,0 25% 75% 1,0 25 30 35 40 45 50 Stimulus Median f0 (semitone) Listeners’ judgments are dependent on the gender of the speaker, but within a gender they tend to hear differences in gender range.