Prosody in Generation 1 JH 7/15/2016

Prosody in Generation JH 7/15/2016 1 Natural Language Generation (NLG) • Typical NLG system does  Text planning transforms communicative goal into sequence or structure of elementary goals  Sentence planning chooses linguistic resources to achieve those goals  Realization produces surface output JH 7/15/2016 2 Research Directions in NLG • Past focus  Hand-crafted rules inspired by small corpora  Very little evaluation  Monologue text generation • New directions  Large-scale corpus-based learning of system components  Evaluation important but how to do it still unclear  Spoken monologue and dialogue JH 7/15/2016 3 AT&T Labs Research How to produce speech instead of text? 7/15/2016 4 Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-toSpeech (CTS) • Current Approaches to CTS  Hand-built systems  Corpus-based systems • NLG Evaluation • Open Questions JH 7/15/2016 5 Importance of NLG in Dialogue Systems • Conveying information intonationally for conciseness and naturalness  System turns in dialogue systems can be shorter S: Did you say you want to go to Boston? S: (You want to go to) Boston H-H% • Not providing mis-information through misleading prosody ...S: (You want to go to) Boston L-L% JH 7/15/2016 6 • Silverman et al ‘93: Mimicking human prosody improves transcription accuracy in reverse telephone directory task • Sanderman & Collier ‘97 Subjects were quicker to respond to ‘appropriately phrased’ ambiguous responses to questions in a monitoring task Q: How did I reserve a room? vs. Which facility did the hotel have? A: I reserved a room L-H% in the hotel with the fax. A: I reserved a room in the hotel L-H% with the fax. JH 7/15/2016 7 Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-toSpeech (CTS) • Current Approaches to CTS  Hand-built systems  Corpus-based systems • NLG Evaluation • Open Questions JH 7/15/2016 8 Prosodic Generation for TTS • Default prosodic assignment from simple text analysis • Hand-built rule-based system: hard to modify and adapt to new domains • Corpus-based approaches (Sproat et al ’92)  Train prosodic variation on large labeled corpora using machine learning techniques  Accent and phrasing decisions  Associate prosodic labels with simple features of transcripts JH 7/15/2016 9 • • • • # of words in phrase distance from beginning or end of phrase orthography: punctuation, paragraphing part of speech, constituent information  Apply learned rules to new text • Incremental improvements continue:  Adding higher-accuracy parsing (Koehn et al ‘00) • Collins ‘99 parser • More sophisticated learning algorithms (Schapire & Singer ‘00) • Better representations: tree based? • Rules always impoverished • How to define Gold Standard? JH 7/15/2016 10 Spoken NLG • Decisions in Text-to-Speech (TTS) depend on syntax, information status, topic structure,… information explicitly available to NLG • Concept-to-Speech (CTS) systems should be able to specify “better” prosody: the system knows what it wants to say and can specify how • But….generating prosody for CTS isn’t so easy JH 7/15/2016 11 Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-toSpeech (CTS) • Current approaches to CTS  Hand-built systems  Corpus-based systems • NLG evaluation • Open questions JH 7/15/2016 12 Relying upon Prior Research • MIMIC CTS (Nakatani & Chu-Carroll ‘00)  Use domain attribute/value distinction to drive phrasing and accent: critical information focussed Movie: October Sky Theatre: Hoboken Theatre Town: Hoboken • Attribute names and values always accented • Values set off by phrase boundaries  Information status conveyed by varying accent type (Pierrehumbert & Hirschberg ‘90) • Old (given) L* • Inferrable (by MIMIC, e.g. theatre name from town) L*+H JH 7/15/2016 13 • Key (to formulating valid query) L+H* • New H*  Marking Dialogue Acts • NotifyFailure: U: Where is “The Corrupter” playing in Cranford. S: “The Corrupter”[L+H*] is not [L+H*] playing in Cranford [L*+H]. • Other rules for logical connectives, clarification and confirmation subdialogues • Contrastive accent for semantic parallelism (Rooth ‘92, Pulman ‘97) used in GoalGetter and OVIS (Theune ‘99) The cat eats fish. The dog eats meat. JH 7/15/2016 14 But … many counterexamples • Association of prosody with many syntactic, semantic, and pragmatic concepts still an open question • Prosody generation from (past) observed regularities and assumptions:  Information can be ‘chunked’ usefully by phrasing for easier user understanding • But in many different ways  Information status can be conveyed by accent: • Contrastive information is accented? S: You want to go to L+H* Nijmegen, L+H* not Eindhoven. JH 7/15/2016 15  Given information is deaccented? Speaker/hearer givenness U: I want to go to Nijmegen. S: You want to go to H* Nijmegen?  Intonational contours can convey speech acts, speaker beliefs: • Continuation rise can maintain the floor? S: I am going to get you the train information [L-H%].  Backchanneling can be produced appropriately? S: Okay. Okay? Okaaay… Mhmm.. JH 7/15/2016 16  Wh and yes-no questions can be signaled appropriately? S: Where do you want to go. S: What is your passport number?  Discourse/topic structure can be signaled by varying pitch range, pausal duration, rate? JH 7/15/2016 17 Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-toSpeech (CTS) • Current Approaches to CTS  Hand-built systems  Corpus-based systems • NLG Evaluation • Open Questions JH 7/15/2016 18 MAGIC • MM system for presenting cardiac patient data  Developed at Columbia by McKeown and colleagues in conjunction with Columbia Presbyterian Medical Center to automate post-operative status reporting for bypass patients  Uses mostly traditional NLG hand-developed components  Generate text, then annotate prosodically  Corpus-trained prosodic assignment component • Corpus: written and oral patient reports  50min multi-speaker, spontaneous + 11min single speaker, read  1.24M word text corpus of discharge summaries JH 7/15/2016 19  Transcribed, ToBI labeled  Generator features labeled/extracted: • • • • • • • • • • • JH 7/15/2016 syntactic function p.o.s. semantic category semantic ‘informativeness’ (rarity in corpus) semantic constituent boundary location and length salience given/new focus theme/ rheme ‘importance’ ‘unexpectedness’ 20  Very hard to label features • Results: new features to specify TTS prosody  Of CTS-specific features only semantic informativeness (likeliness of occuring in a corpus) useful so far (Pan & McKeown ‘99)  Looking at context, word collocation for accent placement helps predict accent (Pan & Hirschberg ‘00) RED CELL (less predictable) vs. BLOOD cell (more) Most predictable words are accented less frequently (4046%) and least predictable more (73-80%) Unigram+bigram model predicts accent status w/77% (+/-.51) accuracy JH 7/15/2016 21 Stochastic, Corpus-based NLG • Generate from a corpus rather than handbuilt system  For MT task, Langkilde & Knight ‘98 overgenerate from traditional hand-built grammar  Output composed into lattice  Linear (bigram) language model chooses best path • But …  no guarantee of grammaticality  How to evaluate/improve results?  How to incorporate prosody into this kind of generation model? JH 7/15/2016 22 FERGUS (Bangalore & Rambow ‘00) • Corpus-based learning to refine syntactic, lexical and prosodic choice • Domain is DARPA Communicator task (air travel information) • Uses stochastic tree model + linear LM + XTAG (hand-crafted) grammar • Trained on WSJ dependency trees tagged with p.o.s., morphological information, syntactic SuperTags (grammatical function, subcat frame, arg realization), WordNet sense tags and prosodic labels (accent and boundary) JH 7/15/2016 23 • Input:  Dependency tree of lexemes  Any feature can be specified, e.g. syntactic, prosodic control poachers <L+H*> now trade the JH 7/15/2016 underground 24 • Tree Chooser:  Selects syntactic/prosodic properties for input nodes based match with features of mothers and daughters in corpus control poachers<L+H*> now trade the JH 7/15/2016 underground 25 • Unraveler:  Produces lattice of all syntactically possible linearizations of tree using XTAG grammar poachers s now now poachers trade control underground the underground trade JH 7/15/2016 26 • Linear Precedence Chooser:  Finds most likely lattice traversal, using trigram language model Now [H*] poachers [L+H*] [L-] control the underground trade [H*] [L-L%]. • Many ways to implement each step  How to choose which works ‘best’?  How to evaluate output? JH 7/15/2016 27 Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-toSpeech (CTS) • Current Approaches to CTS  Hand-built systems  Corpus-based systems • NLG Evaluation • Open Questions JH 7/15/2016 28 Evaluating NLG • How to judge success/progress in NLG an open question  Qualitative measures: preference  Quantitative measures: • task performance measures: speed, accuracy • automatic comparison to a reference corpus (e.g. string edit-distance and variants, tree-similarity-based metrics)  Not always a single “best” solution • Critical for stochastic systems to combine qualitative judgments with quantitative measures (Walker et al ’97) JH 7/15/2016 29 Qualitative Validation of Quantitative Metrics • Subjects judged understandability and quality  Candidates proposed by 4 evaluation metrics to minimize distance from Gold Standard (Bangalore, Rambow & Whittaker ‘00)  Tree-based metrics correlate significantly with understandability and quality judgments -string metrics do not  New objective metrics learned • Understandability accuracy = (1.31*simple tree accuracy -.10*substitutions=.44)/.87 • Quality accuracy = (1.02*simple tree accuracy .08*substitutions - .35)/.67 JH 7/15/2016 30 Overview • Spoken NLG in Dialogue Systems • Text-to-Speech (TTS) vs. Concept-toSpeech (CTS) • Current Approaches to CTS  Hand-built systems  Corpus-based systems • NLG Evaluation • Open Questions JH 7/15/2016 31 More Open Questions for Spoken NLG • How much to model human original? • Planning for appropriate intonational variation even important in recorded prompts • Timing and backchanneling • What kind of output is most comprehensible? • What kind of output elicits most easily understood user response? (Gustafson et al ’97,Clark & Brennan ‘99) • Implementing variations in dialogue strategy  Implicit confirmation  Mixed initiative JH 7/15/2016 32

Prosody in Generation 1 JH 7/15/2016

Related documents

Products

Support

Prosody in Generation 1 JH 7/15/2016

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib