Prosody in Generation 1 JH 7/15/2016

advertisement
Prosody in Generation
JH 7/15/2016
1
Natural Language Generation
(NLG)
• Typical NLG system does
 Text planning transforms communicative goal into
sequence or structure of elementary goals
 Sentence planning chooses linguistic resources to
achieve those goals
 Realization produces surface output
JH 7/15/2016
2
Research Directions in NLG
• Past focus
 Hand-crafted rules inspired by small corpora
 Very little evaluation
 Monologue text generation
• New directions
 Large-scale corpus-based learning of system
components
 Evaluation important but how to do it still
unclear
 Spoken monologue and dialogue
JH 7/15/2016
3
AT&T Labs Research
How to produce speech instead of
text?
7/15/2016
4
Overview
• Spoken NLG in Dialogue Systems
• Text-to-Speech (TTS) vs. Concept-toSpeech (CTS)
• Current Approaches to CTS
 Hand-built systems
 Corpus-based systems
• NLG Evaluation
• Open Questions
JH 7/15/2016
5
Importance of NLG in Dialogue
Systems
• Conveying information intonationally for
conciseness and naturalness
 System turns in dialogue systems can be
shorter
S: Did you say you want to go to Boston?
S: (You want to go to) Boston H-H%
• Not providing mis-information through
misleading prosody
...S: (You want to go to) Boston L-L%
JH 7/15/2016
6
• Silverman et al ‘93:
Mimicking human prosody improves
transcription accuracy in reverse telephone
directory task
• Sanderman & Collier ‘97
Subjects were quicker to respond to
‘appropriately phrased’ ambiguous responses to
questions in a monitoring task
Q: How did I reserve a room? vs. Which facility did the
hotel have?
A: I reserved a room L-H% in the hotel with the fax.
A: I reserved a room in the hotel L-H% with the fax.
JH 7/15/2016
7
Overview
• Spoken NLG in Dialogue Systems
• Text-to-Speech (TTS) vs. Concept-toSpeech (CTS)
• Current Approaches to CTS
 Hand-built systems
 Corpus-based systems
• NLG Evaluation
• Open Questions
JH 7/15/2016
8
Prosodic Generation for TTS
• Default prosodic assignment from simple
text analysis
• Hand-built rule-based system: hard to
modify and adapt to new domains
• Corpus-based approaches (Sproat et al
’92)
 Train prosodic variation on large labeled
corpora using machine learning techniques
 Accent and phrasing decisions
 Associate prosodic labels with simple features
of transcripts
JH 7/15/2016
9
•
•
•
•
# of words in phrase
distance from beginning or end of phrase
orthography: punctuation, paragraphing
part of speech, constituent information
 Apply learned rules to new text
• Incremental improvements continue:
 Adding higher-accuracy parsing (Koehn et al ‘00)
• Collins ‘99 parser
• More sophisticated learning algorithms (Schapire & Singer
‘00)
• Better representations: tree based?
• Rules always impoverished
• How to define Gold Standard?
JH 7/15/2016
10
Spoken NLG
• Decisions in Text-to-Speech (TTS) depend
on syntax, information status, topic
structure,… information explicitly available
to NLG
• Concept-to-Speech (CTS) systems should
be able to specify “better” prosody: the
system knows what it wants to say and can
specify how
• But….generating prosody for CTS isn’t so
easy
JH 7/15/2016
11
Overview
• Spoken NLG in Dialogue Systems
• Text-to-Speech (TTS) vs. Concept-toSpeech (CTS)
• Current approaches to CTS
 Hand-built systems
 Corpus-based systems
• NLG evaluation
• Open questions
JH 7/15/2016
12
Relying upon Prior Research
• MIMIC CTS (Nakatani & Chu-Carroll ‘00)
 Use domain attribute/value distinction to drive
phrasing and accent: critical information focussed
Movie: October Sky
Theatre: Hoboken Theatre
Town: Hoboken
• Attribute names and values always accented
• Values set off by phrase boundaries
 Information status conveyed by varying accent type
(Pierrehumbert & Hirschberg ‘90)
• Old (given) L*
• Inferrable (by MIMIC, e.g. theatre name from town) L*+H
JH 7/15/2016
13
• Key (to formulating valid query) L+H*
• New H*
 Marking Dialogue Acts
• NotifyFailure:
U: Where is “The Corrupter” playing in Cranford.
S: “The Corrupter”[L+H*] is not [L+H*] playing in Cranford
[L*+H].
• Other rules for logical connectives, clarification and
confirmation subdialogues
• Contrastive accent for semantic parallelism
(Rooth ‘92, Pulman ‘97) used in GoalGetter
and OVIS (Theune ‘99)
The cat eats fish. The dog eats meat.
JH 7/15/2016
14
But … many counterexamples
• Association of prosody with many
syntactic, semantic, and pragmatic
concepts still an open question
• Prosody generation from (past) observed
regularities and assumptions:
 Information can be ‘chunked’ usefully by
phrasing for easier user understanding
• But in many different ways
 Information status can be conveyed by accent:
• Contrastive information is accented?
S: You want to go to L+H* Nijmegen, L+H* not
Eindhoven.
JH 7/15/2016
15
 Given information is deaccented?
Speaker/hearer givenness
U: I want to go to Nijmegen.
S: You want to go to H* Nijmegen?
 Intonational contours can convey speech acts,
speaker beliefs:
• Continuation rise can maintain the floor?
S: I am going to get you the train information [L-H%].
 Backchanneling can be produced
appropriately?
S: Okay. Okay? Okaaay… Mhmm..
JH 7/15/2016
16
 Wh and yes-no questions can be signaled
appropriately?
S: Where do you want to go.
S: What is your passport number?
 Discourse/topic structure can be signaled by
varying pitch range, pausal duration, rate?
JH 7/15/2016
17
Overview
• Spoken NLG in Dialogue Systems
• Text-to-Speech (TTS) vs. Concept-toSpeech (CTS)
• Current Approaches to CTS
 Hand-built systems
 Corpus-based systems
• NLG Evaluation
• Open Questions
JH 7/15/2016
18
MAGIC
• MM system for presenting cardiac patient
data
 Developed at Columbia by McKeown and colleagues in
conjunction with Columbia Presbyterian Medical Center to
automate post-operative status reporting for bypass patients
 Uses mostly traditional NLG hand-developed components
 Generate text, then annotate prosodically
 Corpus-trained prosodic assignment component
• Corpus: written and oral patient reports
 50min multi-speaker, spontaneous + 11min single speaker,
read
 1.24M word text corpus of discharge summaries
JH 7/15/2016
19
 Transcribed, ToBI labeled
 Generator features labeled/extracted:
•
•
•
•
•
•
•
•
•
•
•
JH 7/15/2016
syntactic function
p.o.s.
semantic category
semantic ‘informativeness’ (rarity in corpus)
semantic constituent boundary location and length
salience
given/new
focus
theme/ rheme
‘importance’
‘unexpectedness’
20
 Very hard to label features
• Results: new features to specify TTS prosody
 Of CTS-specific features only semantic
informativeness (likeliness of occuring in a corpus)
useful so far (Pan & McKeown ‘99)
 Looking at context, word collocation for accent
placement helps predict accent (Pan & Hirschberg
‘00)
RED CELL (less predictable) vs. BLOOD cell (more)
Most predictable words are accented less frequently (4046%) and least predictable more (73-80%)
Unigram+bigram model predicts accent status w/77% (+/-.51)
accuracy
JH 7/15/2016
21
Stochastic, Corpus-based NLG
• Generate from a corpus rather than handbuilt system
 For MT task, Langkilde & Knight ‘98 overgenerate from traditional hand-built grammar
 Output composed into lattice
 Linear (bigram) language model chooses best
path
• But …
 no guarantee of grammaticality
 How to evaluate/improve results?
 How to incorporate prosody into this kind of
generation model?
JH 7/15/2016
22
FERGUS (Bangalore & Rambow
‘00)
• Corpus-based learning to refine syntactic,
lexical and prosodic choice
• Domain is DARPA Communicator task (air
travel information)
• Uses stochastic tree model + linear LM +
XTAG (hand-crafted) grammar
• Trained on WSJ dependency trees tagged
with p.o.s., morphological information,
syntactic SuperTags (grammatical function,
subcat frame, arg realization), WordNet
sense tags and prosodic labels (accent and
boundary)
JH 7/15/2016
23
• Input:
 Dependency tree of lexemes
 Any feature can be specified, e.g. syntactic, prosodic
control
poachers <L+H*>
now
trade
the
JH 7/15/2016
underground
24
• Tree Chooser:
 Selects syntactic/prosodic properties for input nodes based match
with features of mothers and daughters in corpus
control
poachers<L+H*>
now
trade
the
JH 7/15/2016
underground
25
• Unraveler:
 Produces lattice of all syntactically possible
linearizations of tree using XTAG grammar
poachers
s
now
now
poachers
trade
control
underground
the
underground
trade
JH 7/15/2016
26
• Linear Precedence Chooser:
 Finds most likely lattice traversal, using trigram
language model
Now [H*] poachers [L+H*] [L-] control the
underground trade [H*] [L-L%].
• Many ways to implement each step
 How to choose which works ‘best’?
 How to evaluate output?
JH 7/15/2016
27
Overview
• Spoken NLG in Dialogue Systems
• Text-to-Speech (TTS) vs. Concept-toSpeech (CTS)
• Current Approaches to CTS
 Hand-built systems
 Corpus-based systems
• NLG Evaluation
• Open Questions
JH 7/15/2016
28
Evaluating NLG
• How to judge success/progress in NLG an
open question
 Qualitative measures: preference
 Quantitative measures:
• task performance measures: speed, accuracy
• automatic comparison to a reference corpus (e.g. string
edit-distance and variants, tree-similarity-based metrics)
 Not always a single “best” solution
• Critical for stochastic systems to combine
qualitative judgments with quantitative
measures (Walker et al ’97)
JH 7/15/2016
29
Qualitative Validation of
Quantitative Metrics
• Subjects judged understandability and
quality
 Candidates proposed by 4 evaluation metrics
to minimize distance from Gold Standard
(Bangalore, Rambow & Whittaker ‘00)
 Tree-based metrics correlate significantly with
understandability and quality judgments -string metrics do not
 New objective metrics learned
• Understandability accuracy = (1.31*simple tree
accuracy -.10*substitutions=.44)/.87
• Quality accuracy = (1.02*simple tree accuracy .08*substitutions - .35)/.67
JH 7/15/2016
30
Overview
• Spoken NLG in Dialogue Systems
• Text-to-Speech (TTS) vs. Concept-toSpeech (CTS)
• Current Approaches to CTS
 Hand-built systems
 Corpus-based systems
• NLG Evaluation
• Open Questions
JH 7/15/2016
31
More Open Questions for Spoken
NLG
• How much to model human original?
• Planning for appropriate intonational variation even
important in recorded prompts
• Timing and backchanneling
• What kind of output is most comprehensible?
• What kind of output elicits most easily understood
user response? (Gustafson et al ’97,Clark & Brennan
‘99)
• Implementing variations in dialogue strategy
 Implicit confirmation
 Mixed initiative
JH 7/15/2016
32
Download