Spoken Language Processing:Summing Up Julia Hirschberg CS 4706 7/15/2016 1 What We’ve Studied • Speech phenomena – What can people convey by varying the way they say something? – How we identify this kind of variation? – What tools do we have for analysis? • Speech generation (TTS) • Speech recognition (ASR) and understanding (ASRU) • Applications for speech technologies 7/15/2016 2 What phenomena vary in speech? • Intonational contours (ToBI) – Phrasing: scope – Accent: focus, given/new – Overall contour: speech acts • Pitch range, timing – Topic structure • Voice quality, intensity, … – Emotion – Deception? – Charisma? 7/15/2016 3 Analyzing Speech: At the Acoustic Level • How do we capture speech data for analysis? – Digitizing: sampling, quantization, filtering • How can we distinguish one speech sound from another? – Periodic vs. aperiodic waveforms • Characterizing periodic waveforms: cycle, period, phase – Displaying and analyzing spectra, pitch tracks – Comparing intensity (db) • Tools to do all this and more: Praat 7/15/2016 4 7/15/2016 5 Analyzing Speech: At the Phonetic Level • Can we distinguish different languages in terms of their phoneme sets? Are their universal constraints on possible speech sounds? – Articulatory constraints • How do we characterize the sounds of a given language: – Acoustic differences associated with place and manner of articulation distinguish consonants – Vowels differ in their formant frequencies • Do we use such information in speech technologies? 7/15/2016 6 Articulators in action (Sample from the Queen’s University / ATR Labs X-ray Film Database) “Why did Ken set the soggy net on top of his deck?” 7/15/2016 7 Articulatory parameters for English consonants (in ARPAbet) MANNER OF ARTICULATION PLACE OF ARTICULATION bilabial stop p labiodental interdental b fric. f v th dh alveolar t d s z affric. nasal m n approx w l/r flap 7/15/2016 palatal velar k sh zh ch jh glottal g q h ng y dx VOICING: voiceless 8 voiced American English vowel space HIGH iy uw ix ih FRONT ux ax eh ah ae uh ao BACK aa LOW 7/15/2016 9 Analyzing Speech: At the Phononological Level • How do people develop models of intonation? • ToBI – Tones: Pitch accents, phrase accents, boundary tones – Break indices • Hand labeling vs. automatic analysis – Which provides more useful information? 7/15/2016 10 L-L% L-H% H-L% H-H% H* L* L*+H 7/15/2016 11 L-L% L-H% H-L% H-H% L+H* H+!H* H* !H* 7/15/2016 12 Speech Generation • Synthesis then and now • Open problems in TTS: – Pronunciation modeling: OOV words, homographs, abbreviations – Predicting pitch accents and phrase boundaries: corpus-based approaches – Information status: focus, given/new – Modeling discourse structure – Producing emotional speech – Evaluation 7/15/2016 13 Speech Recognition/Understanding • ASR then and now: From speaker-dependent digit recognition using analog circuits to HMM-based speakerindependent recognition of spontaneous speech by computer • Open problems – Segmentation: sentence, speaker, topic – OOV recognition – Handling disfluencies – Evaluation: transcription, semantic, task-based? – Recognizing emotion and other types of speaker state 7/15/2016 14 Spoken Dialogue Systems • Integrating TTS and ASR with dialogue management and task-based components • Open questions: – Improving ASR accuracy – Recognizing dialogue acts – Turn-taking behavior – Confirmation strategies and initiative – Entrainment and ‘personality’ – Evaluation 7/15/2016 15 Recognizing Speaker State and Diagnosis • Emotional speech – Voice quality • Deceptive speech • Charismatic speech • Customer care rep evaluation • Medical diagnosis – Paranoia and other psychiatric disorders – Cancer patient prognosis 7/15/2016 16 Take-Home Final • • Due: May 14 by 4:10 pm Submission instructions: – This examination is designed to test your ability to synthesize information and to perform critical analysis of published research. Choose 3 of the following 4 questions to answer Each question should be answered with specific reference to the readings specified, all of which are linked to the syllabus for the class on the date given. (I.e., cite articles with page numbers to support claims about authors’ findings or claims, as “McLeod et al. (1998) claims that existing Spoken Dialogue Systems’ major drawback is their lack of delightful personalities (p. 4).”) Do not attempt to answer the questions until you have read and understood the specified articles. Essays that do not show evidence of this understanding will not receive high marks. – Each essay will be worth 33 1/3 points. Each essay should be no more than 1200 words in length; only the first 1200 words of each essay will be graded, so please do not exceed this limit. If you can answer the question in a shorter essay, feel free to do so. Please use plain ascii or Word and report word-counts for each essay. 7/15/2016 17 Sample Question Agree or disagree: “It is more difficult to recognize deception automatically from acoustic/prosodic and lexical cues than from visual cues obtained from face or body gesture.” Use the readings assigned for April 28 to support your answer. 1. Show that you understand the question and are answering it • E.g. “I believe that it is more difficult to recognize deception automatically from from visual cues than from acoustic/prosodic and lexical cues.” 2. For agree/disagree questions, decide whether you basically agree or disagree • 7/15/2016 e.g. “While there are difficulties recognizing deception from both types of cues, I believe it is more difficult to recognize deception from visual cues than from language-based cues.” 18 3. Provide evidence on both sides of the question • • “While both audio and visual cues require high quality recordings, audio recordings must be obtained in a quiet environment whereas video recordings can be obtained in a wider variety of situations, providing that equipment is available.” “While Mehrabian (1971) found significant effects for both visual and language-based cues, the particular language cues he identified in this study would seem to be easier to recognize automatically than the visual cues: For example, it should be easier to identify amount of speech and speaking rate than features such as ‘rocking gestures’ and ‘leg and foot movements’.” 4. Support your statements with specific reference to your sources • • 7/15/2016 e.g. “DePaulo et al (1983) find that…” Or, “Motivation greatly influences subjects’ ability to control their verbal cues (DePaulo et al, 1983).” 19 • When in doubt, cite 7/15/2016 20