Prosody in Generation JH 7/15/2016 1 Natural Language Generation (NLG) • Typical NLG system does Text planning transforms communicative goal into sequence or structure of elementary goals Sentence planning chooses linguistic resources to achieve those goals Realization produces surface output JH 7/15/2016 2 Research Directions in NLG • Past focus Hand-crafted rules inspired by small corpora Very little evaluation Monologue text generation • New directions Large-scale corpus-based learning of system components Evaluation important but how to do it still unclear Spoken monologue and dialogue JH 7/15/2016 3 AT&T Labs Research How to produce speech instead of text? 7/15/2016 4 Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-toSpeech (CTS) • Current Approaches to CTS Hand-built systems Corpus-based systems • NLG Evaluation • Open Questions JH 7/15/2016 5 Importance of NLG in Dialogue Systems • Conveying information intonationally for conciseness and naturalness System turns in dialogue systems can be shorter S: Did you say you want to go to Boston? S: (You want to go to) Boston H-H% • Not providing mis-information through misleading prosody ...S: (You want to go to) Boston L-L% JH 7/15/2016 6 • Silverman et al ‘93: Mimicking human prosody improves transcription accuracy in reverse telephone directory task • Sanderman & Collier ‘97 Subjects were quicker to respond to ‘appropriately phrased’ ambiguous responses to questions in a monitoring task Q: How did I reserve a room? vs. Which facility did the hotel have? A: I reserved a room L-H% in the hotel with the fax. A: I reserved a room in the hotel L-H% with the fax. JH 7/15/2016 7 Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-toSpeech (CTS) • Current Approaches to CTS Hand-built systems Corpus-based systems • NLG Evaluation • Open Questions JH 7/15/2016 8 Prosodic Generation for TTS • Default prosodic assignment from simple text analysis • Hand-built rule-based system: hard to modify and adapt to new domains • Corpus-based approaches (Sproat et al ’92) Train prosodic variation on large labeled corpora using machine learning techniques Accent and phrasing decisions Associate prosodic labels with simple features of transcripts JH 7/15/2016 9 • • • • # of words in phrase distance from beginning or end of phrase orthography: punctuation, paragraphing part of speech, constituent information Apply learned rules to new text • Incremental improvements continue: Adding higher-accuracy parsing (Koehn et al ‘00) • Collins ‘99 parser • More sophisticated learning algorithms (Schapire & Singer ‘00) • Better representations: tree based? • Rules always impoverished • How to define Gold Standard? JH 7/15/2016 10 Spoken NLG • Decisions in Text-to-Speech (TTS) depend on syntax, information status, topic structure,… information explicitly available to NLG • Concept-to-Speech (CTS) systems should be able to specify “better” prosody: the system knows what it wants to say and can specify how • But….generating prosody for CTS isn’t so easy JH 7/15/2016 11 Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-toSpeech (CTS) • Current approaches to CTS Hand-built systems Corpus-based systems • NLG evaluation • Open questions JH 7/15/2016 12 Relying upon Prior Research • MIMIC CTS (Nakatani & Chu-Carroll ‘00) Use domain attribute/value distinction to drive phrasing and accent: critical information focussed Movie: October Sky Theatre: Hoboken Theatre Town: Hoboken • Attribute names and values always accented • Values set off by phrase boundaries Information status conveyed by varying accent type (Pierrehumbert & Hirschberg ‘90) • Old (given) L* • Inferrable (by MIMIC, e.g. theatre name from town) L*+H JH 7/15/2016 13 • Key (to formulating valid query) L+H* • New H* Marking Dialogue Acts • NotifyFailure: U: Where is “The Corrupter” playing in Cranford. S: “The Corrupter”[L+H*] is not [L+H*] playing in Cranford [L*+H]. • Other rules for logical connectives, clarification and confirmation subdialogues • Contrastive accent for semantic parallelism (Rooth ‘92, Pulman ‘97) used in GoalGetter and OVIS (Theune ‘99) The cat eats fish. The dog eats meat. JH 7/15/2016 14 But … many counterexamples • Association of prosody with many syntactic, semantic, and pragmatic concepts still an open question • Prosody generation from (past) observed regularities and assumptions: Information can be ‘chunked’ usefully by phrasing for easier user understanding • But in many different ways Information status can be conveyed by accent: • Contrastive information is accented? S: You want to go to L+H* Nijmegen, L+H* not Eindhoven. JH 7/15/2016 15 Given information is deaccented? Speaker/hearer givenness U: I want to go to Nijmegen. S: You want to go to H* Nijmegen? Intonational contours can convey speech acts, speaker beliefs: • Continuation rise can maintain the floor? S: I am going to get you the train information [L-H%]. Backchanneling can be produced appropriately? S: Okay. Okay? Okaaay… Mhmm.. JH 7/15/2016 16 Wh and yes-no questions can be signaled appropriately? S: Where do you want to go. S: What is your passport number? Discourse/topic structure can be signaled by varying pitch range, pausal duration, rate? JH 7/15/2016 17 Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-toSpeech (CTS) • Current Approaches to CTS Hand-built systems Corpus-based systems • NLG Evaluation • Open Questions JH 7/15/2016 18 MAGIC • MM system for presenting cardiac patient data Developed at Columbia by McKeown and colleagues in conjunction with Columbia Presbyterian Medical Center to automate post-operative status reporting for bypass patients Uses mostly traditional NLG hand-developed components Generate text, then annotate prosodically Corpus-trained prosodic assignment component • Corpus: written and oral patient reports 50min multi-speaker, spontaneous + 11min single speaker, read 1.24M word text corpus of discharge summaries JH 7/15/2016 19 Transcribed, ToBI labeled Generator features labeled/extracted: • • • • • • • • • • • JH 7/15/2016 syntactic function p.o.s. semantic category semantic ‘informativeness’ (rarity in corpus) semantic constituent boundary location and length salience given/new focus theme/ rheme ‘importance’ ‘unexpectedness’ 20 Very hard to label features • Results: new features to specify TTS prosody Of CTS-specific features only semantic informativeness (likeliness of occuring in a corpus) useful so far (Pan & McKeown ‘99) Looking at context, word collocation for accent placement helps predict accent (Pan & Hirschberg ‘00) RED CELL (less predictable) vs. BLOOD cell (more) Most predictable words are accented less frequently (4046%) and least predictable more (73-80%) Unigram+bigram model predicts accent status w/77% (+/-.51) accuracy JH 7/15/2016 21 Stochastic, Corpus-based NLG • Generate from a corpus rather than handbuilt system For MT task, Langkilde & Knight ‘98 overgenerate from traditional hand-built grammar Output composed into lattice Linear (bigram) language model chooses best path • But … no guarantee of grammaticality How to evaluate/improve results? How to incorporate prosody into this kind of generation model? JH 7/15/2016 22 FERGUS (Bangalore & Rambow ‘00) • Corpus-based learning to refine syntactic, lexical and prosodic choice • Domain is DARPA Communicator task (air travel information) • Uses stochastic tree model + linear LM + XTAG (hand-crafted) grammar • Trained on WSJ dependency trees tagged with p.o.s., morphological information, syntactic SuperTags (grammatical function, subcat frame, arg realization), WordNet sense tags and prosodic labels (accent and boundary) JH 7/15/2016 23 • Input: Dependency tree of lexemes Any feature can be specified, e.g. syntactic, prosodic control poachers <L+H*> now trade the JH 7/15/2016 underground 24 • Tree Chooser: Selects syntactic/prosodic properties for input nodes based match with features of mothers and daughters in corpus control poachers<L+H*> now trade the JH 7/15/2016 underground 25 • Unraveler: Produces lattice of all syntactically possible linearizations of tree using XTAG grammar poachers s now now poachers trade control underground the underground trade JH 7/15/2016 26 • Linear Precedence Chooser: Finds most likely lattice traversal, using trigram language model Now [H*] poachers [L+H*] [L-] control the underground trade [H*] [L-L%]. • Many ways to implement each step How to choose which works ‘best’? How to evaluate output? JH 7/15/2016 27 Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-toSpeech (CTS) • Current Approaches to CTS Hand-built systems Corpus-based systems • NLG Evaluation • Open Questions JH 7/15/2016 28 Evaluating NLG • How to judge success/progress in NLG an open question Qualitative measures: preference Quantitative measures: • task performance measures: speed, accuracy • automatic comparison to a reference corpus (e.g. string edit-distance and variants, tree-similarity-based metrics) Not always a single “best” solution • Critical for stochastic systems to combine qualitative judgments with quantitative measures (Walker et al ’97) JH 7/15/2016 29 Qualitative Validation of Quantitative Metrics • Subjects judged understandability and quality Candidates proposed by 4 evaluation metrics to minimize distance from Gold Standard (Bangalore, Rambow & Whittaker ‘00) Tree-based metrics correlate significantly with understandability and quality judgments -string metrics do not New objective metrics learned • Understandability accuracy = (1.31*simple tree accuracy -.10*substitutions=.44)/.87 • Quality accuracy = (1.02*simple tree accuracy .08*substitutions - .35)/.67 JH 7/15/2016 30 Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-toSpeech (CTS) • Current Approaches to CTS Hand-built systems Corpus-based systems • NLG Evaluation • Open Questions JH 7/15/2016 31 More Open Questions for Spoken NLG • How much to model human original? • Planning for appropriate intonational variation even important in recorded prompts • Timing and backchanneling • What kind of output is most comprehensible? • What kind of output elicits most easily understood user response? (Gustafson et al ’97,Clark & Brennan ‘99) • Implementing variations in dialogue strategy Implicit confirmation Mixed initiative JH 7/15/2016 32