Speech-to-Speech Translation with Clarifications

Speech-to-Speech Translation with Clarifications Julia Hirschberg, Svetlana Stoyanchev Columbia University September 18, 2013 Outline  Main Problem  Key Ideas  Solution Details  Impact  Issues, Gaps, and Future work Speech Translation  Speech-to-Speech translation system L1 Speaker L2 Speaker Speech Question (L1) Translated Answer (L1) lation Translation System Translated Question (L2) Answer (L2) Speech Translation  Translation may be impaired by:    Speech recognition errors  Word Error rate in English side of Transtac is 9%  Word error rate in Let’s Go bus information is 50% A speaker may use ambiguous language A speech recognition error may be caused by use of out-of-vocabulary words Speech Translation  Speech-to-Speech translation system  Introduce a clarification component L1 Speaker L2 Speaker Translation System Dialogue Manager Translated Answer (L1)) Dialogue Manager Speech Question (L1) Clarification sub-dialogue Translated Question (L2) Answer (L2) Clarification sub-dialogue Key Ideas  Use targeted clarifications  Address challenges with targeted clarifications  Data collection for system evaluation Most Common Clarification Strategies in Dialogue Systems  “Please repeat”  “Please rephrase”  System repeats the previous question What Clarification Questions Do Human Speakers Ask?  Targeted reprise questions (M. Purver) o Ask a targeted question about the part of an utterance that was misheard or misunderstood, including understood portions of the utterance o Speaker: Do you have anything other than these XXX plans? o Non-Reprise: What did you say?/Please repeat. o Reprise: What kind of plans?  88% of human clarification questions are reprise  12% non-reprise • Goal: Introduce targeted (reprise) questions into a spoken system Advantages of Targeted Clarifications    More natural User does not have to repeat the whole utterance/command Provides grounding and implicit confirmation  Speech-to-speech translation  Useful in systems that handle natural language user responses/commands/queries and a wide range of topics and vocabulary  Tutoring system  Virtual assistants (in car, in home): a user command may contain ASR error due to noise, background speech, etc. Types of Clarification Questions in the TBOLT System    Rephrase part • Used when an error is OOV and NOT a name (works on difficult non-OOV words as well) • Asks to rephrase the error segment • “I did not understand when you said: fiscal. Please give me another word or phrase for it.” Spelling • Used for names • “Please spell ‘Rockefeller’.” Disambiguation • Used to disambiguate between homophones • “Did you mean plain as in extensive tract of level open land, or, plane as in an aircraft?” Types of Questions (cont.) • • • Reprise (as found in human-human communication) o Repeats part of the utterance before the error segment o User: We will search some of the XXX to make sure everyone is safe. o System: We will search some of the what? Reprise/ Rephrase-part o Combines a targeted question with a rephrase question o System: We will search some of the what? Please say another word or phrase for this: ‘vehicles’. Confirmation o A yes/no question to confirm an utterance o “Did you say ‘the breach is located here’?” Requirement for a Targeted Question  Error Detection  Error segment boundaries  Error type  Does the error contain a proper name?  Does the error contain an out-of-vocabulary (OOV) word? Challenges with Using Targeted Clarifications 1) A system's clarification question may not be appropriate because  An error segment may not be detected correctly  An error type is not identified correctly 2) How should a system construct a targeted/reprise clarification question 3) How long should the system persist in its clarification? Our goal is to address these challenges empirically Examples of Inappropriate Clarification Questions  Rephrase-part (incorrect error type detected) • User: Captain Bell wants to ask you some questions. • Error Detection: Captain <OOV> to ask you some questions. • System: I did not understand when you said "Bell wants". Please give me another word or phrase for it. Reprise/ Targeted Rephrase (error segment is identified incorrectly)  • User: The insurgents are denouncing us. • ASR/Error detection: The insurgents are doing now and <ERROR> • System: The insurgents are doing now and what? Please say another word or phrase for "cing us". Examples (cont.) • Spelling (error type is incorrectly detected as a name) • User: It sounds like there is a kernel of truth to your story. • Error Detection: It sounds like there is a <NAME> of truth to your story. • System: I think I heard a name when you said this: "kernel". Please spell that name. Examples (cont.) • Disambiguation (disambiguated word is misunderstood) • User: How often do you have problems with generators? • ASR/Error Detection: How often do you have problems with general? • System: Did you mean general as in broad or general as in a military officer? Causes of Inappropriate Questions  Rephrase part     Disambiguate    Neither choice for disambiguation is correct Disambiguated word is misrecognized Spell    Partial word is detected as an error Detected segment contains a name Detected segment is a function word (to, from, the …) Non-name Long segment Reprise Contains undetected recognition error Goal • Develop a method to automatically identify when an inappropriate question is asked  Use user’s answers to detect if a question was inappropriate Data Collection • Simulation clarification system • Users were asked to read a sentence and then were played a pre-recorded question • Led to believe they were interacting with the actual system Data Collection(cont.) • Prepared 228 questions  84 appropriate  144 inappropriate  For each type of clarification questions, create appropriate and inappropriate questions,  Total 19 categories of clarification questions • Each subject was asked 144 questions • Recorded their initial utterances and their answers to the questions User Responses • Subjects tended to be cooperative • Answers varied from subject to subject • Example: “I did not understand when you said: ‘Betirma’. Please give me another word or phrase for it.” o “No" o "Betirma" o “Betirma bravo echo tango india romeo mike alpha" User Responses (cont.) • Example 2:  User: “How often do you have problems with generators?”  System: “Did you mean general as in broad or general as in a military officer?” o "generator as in a machine for making electricity" o "no" o "generators" Method  Extract lexical and prosodic features from responses  Number of pauses, speech energy, speech tempo  Lexical and prosodic difference between initial response and an answer to clarification  Measure number of times subjects replay each question  Measure latency: length of pause before answer • Determine whether questions are appropriate or inappropriate based on user responses Challenge 2: Constructing Targeted Clarification Questions  Previous work: collected clarification questions using mturk (Stoyanchev et al. 2012, 2013)  Using human-generated questions manually created a set of generation rules  Evaluated generated questions with human subjects Types of Questions  R_GEN Generic: <context before error> what?    Applies if no other rules apply Sentence: The doctor will most likely prescribe XXX Question: the doctor will most likely prescribe WHAT?  R_SYN Syntactic: about <context before error> what about <context after error> ?    Applies when: there is VB after error; VB and error share a parent Sentence: When was the XXX contacted? Question: When was WHAT contacted?  R_NMOD: which <parent word>?    Applies when: DEP TAG error = NMOD and parent POS = NN | NNS Sentence: Do you have anything other than these XXX plans Question: Which plans?  R_START: what about <context after error> Evaluation Questionnaire • Generated questions automatically using the rules for a set of 84 sentences • Asked humans (mturk) to create a clarification questions for the same sentences • Questionnaire applied to both human and computergenerated questions Subjects  Mturk  Recruited 6 subjects from the lab  Inter-annotator Agreement Results Results Discussion  R_GEN and R_SYN performance is comparable to human-generated questions  R_NMOD (which …?) outperforms all other question types including human-generated questions  R_START rule did not work Key Ideas  Use Targeted Clarifications  Address challenges with targeted clarifications  Experiment on automatic detection of inappropriate questions  Experiment on automatic detection of when to terminate clarification  Data collection for system evaluation Image Description and Questioning Show user an image and ask to describe it and construct questions  Speaker1:     A car is burning behind the girl The girl looks startled There was a massive explosion Speaker2:    A woman is standing in front of a burning car Everything around her seems to have been destroyed What caused this destruction? Data Collection for System Evaluation • Advantages:    • Do not prime users with words in a verbally described scenario Elicits natural speech compared to reading Can be extended to a 2-way dialogue where the interviewee is given a narrative or video information for answering interviewer's questions. Disadvantages:   Uncontrolled vocabulary (can not force to mispronounce words) No control across subject pairs Impact  Impact on Speech-to-Speech Translation  Detecting when a targeted clarification question was inappropriate is an important feature for determining next dialogue move in clarification  Impact beyond Speech-to-Speech Translation  Targeted clarifications can be used in spoken dialogue systems  Especially useful for non-slot-filling (tutoring, virtual assistants) Future Work  Appropriate and inappropriate questions  Analyze the data collected in responses to appropriate and inappropriate clarification questions  Use machine learning to predict if an utterance is an answer to appropriate or inappropriate clarification question  Targeted (reprise) clarification questions  Which information from an initial sentence should a reprise clarification question contain?  Using human-constructed questions, determine which information is essential to be repeated in a targeted question  Clarification length   How long should the system focus on a targeted clarification before back off? Collect data and use machine learning to predict on each system’s turn whether a clarification should continue or stops Conclusions • • Used an error-simulation system to collect data  Data collection experiment for automatic detection of answers to 'inappropriate' system clarifications  Evaluation of automatically generated reprise clarification questions shows that they could be used in a system  Proposed an experiment for determining an optimal length of targeted clarification Collected audio data for system evaluation using an image description method Thank you Questions? Challenge 3: Clarification Length How long should the system focus on a targeted clarification before back off?    In a Speech-to-Speech translation: back-off= translate In spoken dialogue systems : back-off = ask a generic question to 'please rephrase'. •The answer depends on how patient and cooperative are users. Evaluation of Clarification Length • BOLT 2012 system behaviour: System asks targeted clarification at most 3 times before translating. • Goal: Determine dynamically at each clarification turn whether the system should terminate clarification process. • Use data to learn the dialogue strategy Experiment Design • Simulate sequence of unsuccessful clarification questions. • Give user an option to hit “translate” button • Distractor cases: Simulate successful clarification • •  User: This computer is not operational  System: Please rephrase “not operational”  User: not working  System: thank you ( translate and show next question) Experimental case:  Loop asking 3 – 5 different targeted questions  Clarification dialogue continues until the user hits “translate” Use a combination of distractor and experimental cases Method • Use data to determine when system should give up on a targeted clarification  Apply machine learning  Features:  Dialogue length (more likely to give up as dialogue continues to fail)  Question type  Appropriateness of a clarification question  Confidences of error detection and classification components

Speech-to-Speech Translation with Clarifications

Related documents

Products

Support

Speech-to-Speech Translation with Clarifications

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib