Speech-to-Speech Translation with Clarifications

advertisement
Speech-to-Speech
Translation with
Clarifications
Julia Hirschberg, Svetlana Stoyanchev
Columbia University
September 18, 2013
Outline

Main Problem

Key Ideas

Solution Details

Impact

Issues, Gaps, and Future work
Speech Translation

Speech-to-Speech translation system
L1 Speaker
L2 Speaker
Speech
Question
(L1)
Translated
Answer (L1)
lation
Translation
System
Translated Question (L2)
Answer (L2)
Speech Translation

Translation may be impaired by:



Speech recognition errors

Word Error rate in English side of Transtac is 9%

Word error rate in Let’s Go bus information is 50%
A speaker may use ambiguous language
A speech recognition error may be caused by use
of out-of-vocabulary words
Speech Translation

Speech-to-Speech translation system

Introduce a clarification component
L1 Speaker
L2 Speaker
Translation
System
Dialogue Manager
Translated
Answer (L1))
Dialogue Manager
Speech
Question
(L1)
Clarification
sub-dialogue
Translated Question (L2)
Answer (L2)
Clarification
sub-dialogue
Key Ideas

Use targeted clarifications

Address challenges with targeted clarifications

Data collection for system evaluation
Most Common Clarification
Strategies in Dialogue Systems

“Please repeat”

“Please rephrase”

System repeats the previous question
What Clarification Questions Do
Human Speakers Ask?

Targeted reprise questions (M. Purver)
o
Ask a targeted question about the part of an utterance that was
misheard or misunderstood, including understood portions of the
utterance
o
Speaker: Do you have anything other than these XXX plans?
o
Non-Reprise: What did you say?/Please repeat.
o
Reprise: What kind of plans?

88% of human clarification questions are reprise

12% non-reprise
•
Goal: Introduce targeted (reprise) questions into
a spoken system
Advantages of Targeted
Clarifications



More natural
User does not have to repeat the whole
utterance/command
Provides grounding and implicit confirmation

Speech-to-speech translation

Useful in systems that handle natural language user
responses/commands/queries and a wide range of topics and
vocabulary

Tutoring system

Virtual assistants (in car, in home): a user command may contain ASR
error due to noise, background speech, etc.
Types of Clarification Questions
in the TBOLT System



Rephrase part
•
Used when an error is OOV and NOT a name (works on difficult non-OOV words as well)
•
Asks to rephrase the error segment
•
“I did not understand when you said: fiscal. Please give me another word or phrase for it.”
Spelling
•
Used for names
•
“Please spell ‘Rockefeller’.”
Disambiguation
•
Used to disambiguate between homophones
•
“Did you mean plain as in extensive tract of level open land, or, plane as in an aircraft?”
Types of Questions (cont.)
•
•
•
Reprise (as found in human-human communication)
o
Repeats part of the utterance before the error segment
o
User: We will search some of the XXX to make sure everyone is safe.
o
System: We will search some of the what?
Reprise/ Rephrase-part
o
Combines a targeted question with a rephrase question
o
System: We will search some of the what? Please say another word or phrase for this:
‘vehicles’.
Confirmation
o
A yes/no question to confirm an utterance
o
“Did you say ‘the breach is located here’?”
Requirement for a Targeted
Question

Error Detection
 Error segment boundaries
 Error type
 Does the error contain a proper name?
 Does the error contain an out-of-vocabulary
(OOV) word?
Challenges with Using Targeted
Clarifications
1) A system's clarification question may not be
appropriate because

An error segment may not be detected correctly

An error type is not identified correctly
2) How should a system construct a targeted/reprise
clarification question
3) How long should the system persist in its clarification?
Our goal is to address these challenges empirically
Examples of Inappropriate
Clarification Questions

Rephrase-part (incorrect error type detected)
•
User: Captain Bell wants to ask you some questions.
•
Error Detection: Captain <OOV> to ask you some questions.
•
System: I did not understand when you said "Bell wants". Please give me
another word or phrase for it.
Reprise/ Targeted Rephrase (error segment is identified
incorrectly)

•
User: The insurgents are denouncing us.
•
ASR/Error detection: The insurgents are doing now and <ERROR>
•
System: The insurgents are doing now and what? Please say another word
or phrase for "cing us".
Examples (cont.)
• Spelling (error type is incorrectly detected as a
name)
• User: It sounds like there is a kernel of truth to your
story.
• Error Detection: It sounds like there is a <NAME> of
truth to your story.
• System: I think I heard a name when you said this:
"kernel". Please spell that name.
Examples (cont.)
•
Disambiguation (disambiguated word is
misunderstood)
•
User: How often do you have problems with generators?
•
ASR/Error Detection: How often do you have problems with general?
•
System: Did you mean general as in broad or general as in a military
officer?
Causes of Inappropriate
Questions

Rephrase part




Disambiguate



Neither choice for disambiguation is correct
Disambiguated word is misrecognized
Spell



Partial word is detected as an error
Detected segment contains a name
Detected segment is a function word (to, from, the …)
Non-name
Long segment
Reprise
Contains undetected recognition error
Goal
• Develop a method to automatically identify when
an inappropriate question is asked

Use user’s answers to detect if a question was
inappropriate
Data Collection
• Simulation clarification system
• Users were asked to read a sentence and then were
played a pre-recorded question
• Led to believe they were interacting with the actual
system
Data Collection(cont.)
• Prepared 228 questions

84 appropriate

144 inappropriate

For each type of clarification questions, create
appropriate and inappropriate questions,

Total 19 categories of clarification questions
• Each subject was asked 144 questions
• Recorded their initial utterances and their answers to
the questions
User Responses
• Subjects tended to be cooperative
• Answers varied from subject to subject
• Example: “I did not understand when you said:
‘Betirma’. Please give me another word or phrase for
it.”
o “No"
o "Betirma"
o “Betirma bravo echo tango india romeo mike alpha"
User Responses (cont.)
• Example 2:

User: “How often do you have problems with
generators?”

System: “Did you mean general as in broad or
general as in a military officer?”
o "generator as in a machine for making electricity"
o "no"
o "generators"
Method

Extract lexical and prosodic features from responses

Number of pauses, speech energy, speech
tempo

Lexical and prosodic difference between initial
response and an answer to clarification

Measure number of times subjects replay each
question

Measure latency: length of pause before answer
• Determine whether questions are appropriate or
inappropriate based on user responses
Challenge 2: Constructing Targeted
Clarification Questions
 Previous work: collected clarification questions using
mturk (Stoyanchev et al. 2012, 2013)
 Using human-generated questions manually created a
set of generation rules
 Evaluated generated questions with human subjects
Types of Questions
 R_GEN Generic: <context before error> what?



Applies if no other rules apply
Sentence: The doctor will most likely prescribe XXX
Question: the doctor will most likely prescribe WHAT?
 R_SYN Syntactic: about <context before error> what about <context after error>
?



Applies when: there is VB after error; VB and error share a parent
Sentence: When was the XXX contacted?
Question: When was WHAT contacted?
 R_NMOD: which <parent word>?



Applies when: DEP TAG error = NMOD and parent POS = NN | NNS
Sentence: Do you have anything other than these XXX plans
Question: Which plans?
 R_START: what about <context after error>
Evaluation Questionnaire
• Generated questions automatically using the rules
for a set of 84 sentences
• Asked humans (mturk) to create a clarification
questions for the same sentences
• Questionnaire applied to both human and computergenerated questions
Subjects
 Mturk
 Recruited 6 subjects from the lab
 Inter-annotator Agreement
Results
Results
Discussion
 R_GEN and R_SYN performance is comparable to
human-generated questions
 R_NMOD (which …?) outperforms all other question
types including human-generated questions
 R_START rule did not work
Key Ideas

Use Targeted Clarifications

Address challenges with targeted clarifications

Experiment on automatic detection of inappropriate
questions

Experiment on automatic detection of when to terminate
clarification

Data collection for system evaluation
Image Description and
Questioning
Show user an image and ask to describe it and construct
questions

Speaker1:




A car is burning behind the girl
The girl looks startled
There was a massive explosion
Speaker2:



A woman is standing in front of a
burning car
Everything around her seems to
have been destroyed
What caused this destruction?
Data Collection for System
Evaluation
•
Advantages:



•
Do not prime users with words in a verbally
described scenario
Elicits natural speech compared to reading
Can be extended to a 2-way dialogue where the
interviewee is given a narrative or video
information for answering interviewer's questions.
Disadvantages:


Uncontrolled vocabulary (can not force to
mispronounce words)
No control across subject pairs
Impact
 Impact on Speech-to-Speech Translation
 Detecting when a targeted clarification question was
inappropriate is an important feature for determining next
dialogue move in clarification
 Impact beyond Speech-to-Speech Translation
 Targeted clarifications can be used in spoken dialogue
systems
 Especially useful for non-slot-filling (tutoring, virtual
assistants)
Future Work
 Appropriate and inappropriate questions

Analyze the data collected in responses to appropriate and inappropriate clarification
questions

Use machine learning to predict if an utterance is an answer to appropriate or
inappropriate clarification question
 Targeted (reprise) clarification questions

Which information from an initial sentence should a reprise clarification question
contain?

Using human-constructed questions, determine which information is essential to be
repeated in a targeted question
 Clarification length


How long should the system focus on a targeted clarification before back off?
Collect data and use machine learning to predict on each system’s turn whether a
clarification should continue or stops
Conclusions
•
•
Used an error-simulation system to collect data

Data collection experiment for automatic
detection of answers to 'inappropriate' system
clarifications

Evaluation of automatically generated reprise
clarification questions shows that they could be
used in a system

Proposed an experiment for determining an
optimal length of targeted clarification
Collected audio data for system evaluation using
an image description method
Thank you
Questions?
Challenge 3:
Clarification Length
How long should the system focus on a targeted
clarification before back off?



In a Speech-to-Speech translation: back-off=
translate
In spoken dialogue systems : back-off = ask a
generic question to 'please rephrase'.
•The answer depends on how patient and cooperative
are users.
Evaluation of
Clarification Length
•
BOLT 2012 system behaviour: System asks targeted
clarification at most 3 times before translating.
•
Goal: Determine dynamically at each clarification
turn whether the system should terminate
clarification process.
•
Use data to learn the dialogue strategy
Experiment Design
•
Simulate sequence of unsuccessful clarification questions.
•
Give user an option to hit “translate” button
•
Distractor cases: Simulate successful clarification
•
•

User: This computer is not operational

System: Please rephrase “not operational”

User: not working

System: thank you ( translate and show next question)
Experimental case:

Loop asking 3 – 5 different targeted questions

Clarification dialogue continues until the user hits “translate”
Use a combination of distractor and experimental cases
Method
• Use data to determine when system should give up
on a targeted clarification

Apply machine learning

Features:

Dialogue length (more likely to give up as
dialogue continues to fail)

Question type

Appropriateness of a clarification question

Confidences of error detection and
classification components
Download