A Tail of Two Dialogue Systems

advertisement
CISC882: Natural Language Processing
Final Project
A Tale of Two
Dialogue Models
By Sachin Kamboj
The Two Models
• Missing Axiom Theory Model
– Circuit Fix-It
• Ronnie W. Smith, D. Richard Hipp, Alan W. Biermann: An
architecture for voice dialog systems based on prolog style
theorem proving, Computational Linguistics, Volume 21,
Number 3, September 1995.
• Multimodal Speech and Gesture Models
• Oviatt et. al.: Designing the user interface for multimodal
speech and pen-based gesture applications: State of the art
systems and future research directions, Human Computer
Interaction, 2000, vol. 15(4) pg 263-322
Missing Axiom Theory Model
Circuit Fix-It
Aims/Target Behavior
• The system should be able to handle:
–
–
–
–
–
Convergence to a goal
Subdialogues and effective movement between them
An accounting of user knowledge and abilities
Change of initiative
Expectation of user input
Circuit Fix-It
Sample Dialogs
Computer:
User:
Computer:
User:
Computer:
User:
Computer:
User:
Computer:
User:
Computer:
User:
Computer:
User:
Computer:
Put the knob to one zero.
Okay.
What is the voltage between connector one two one and connector three four?
I do not know.
Locate the voltmeter.
Done.
Add a wire between the minus com hole on the voltmeter and the
connector one two one.
How do I do that?
Connect the end of the black wire with the small plug to the minus
corn hole on the voltmeter.
Done.
Connect the end of the black wire with the large plug to connector one
two one.
Which Knob?
Near the top boundary and in the center is a green region. Now, in
the middle right of this green region is the control knob.
Okay.
Put the knob to one zero.
Circuit Fix-It
Prolog Style Theorem Proving
• Based on the Missing Axiom Theory
– System is built around a theorem prover and the role of language
is to supply the missing axioms
– Goal is stated as a prolog axiom to be proven
– The system tries to prove the axiom in a top-down fashion.
• If the proof succeeds using internally available knowledge, the dialog
terminates without any interaction with the user.
• If the proof fails the system tries to find the missing axiom by engaging
in a dialog
observeposition(sw1,X) ← find(sw1), reportposition (sw1,X)
Circuit Fix-It
Implementing the Subdialog Feature
• One of the requirements of the system is to allow
subdialogs.
• As the system engages in conversation, only to prove
missing axioms, each subdialog involves a separate proof.
– Hence the system cannot follow a simple depth-first policy to
complete a proof.
– Instead, to switch between subdialogs, the system should allow
the freezing of any proof and the transfer of control to a different
proof
• Partially completed proofs have to be maintained in memory.
• Freezing of proofs handled through an Interruptible Prolog Simulator
(IPSIM)
Circuit Fix-It
Accounting for User Knowledge
• The system should know what the user is capable of doing
– The requests should match the abilities of the user
– Abilities of different users will vary
• Novice users will know how to adjust a knob but may not know how to
take a voltmeter reading
– The system uses a user model to determine what can be
expected of the user.
– The user’s capabilities are specified in the form of prolog style
rules
• If the input describes some physical state, then conclude that the user
knows how to observe the physical state. In addition if the physical state
is a property, then infer that the user knows how to locate the object that
has that property.
Circuit Fix-It
Mechanisms for Obtaining Variable Initiative
• Variable initiative takes a role in selecting the next
subdialog to be entered.
• The system implements four levels of initiative:
– Directive Mode: unless the user needs clarification, the system
selects its response according to the next goal
– Suggestive Mode: the system will select its response depending
on the next goal but will allow interruptions to subdialogs about
related goals
– Declarative Mode: the user has dialog control, but the system is
free to mention relevant facts
– Passive Mode: The user has complete control. The system will
provide information only in direct response to the user’s
questions.
Circuit Fix-It
Implementation and Uses of Expectation
• If the computer produces an utterance that is an attempt to
have a specific task step S performed, there are
expectations for any of the following types of responses:
– A statement about the missing or uncertain background
knowledge necessary for the accomplishment of S
– A statement about a subgoal of S.
– A statement about the underlying purpose for S.
– A statement about ancestor task steps of which accomplishment
of S is a part
– A statement indicating the accomplishment of S.
• Expectations serve two purposes:
– The detection and correction of errors
– Provide an indication of the shift between subdialogs
Circuit Fix-It
Implementation and Uses of Expectation
• The system computes the expectations and the cost
of each expectation.
• The system also computes a set of “meanings” (or
semantic representations) of user utterances with a
corresponding cost
• The system combines the two costs:
C = β μ + (1 – β)E
• The meaning with the smallest total cost is
selected as the output of the parser
Circuit Fix-It
Implementation and Uses of Expectation
• An important side effect of matching meanings
with expectations is the ability to interpret an
utterance whose content does not specify its
meaning.
– The reference of pronouns
• Turn the switch up
• Where is it?
– The meaning of short answers
• Turn the switch up
• Okay
– Maintaining dialog coherence
Basic Algorithm
ZmodSubdialog(Goal)
Create subdialog data structures
While there are rules available which may achieve Goal
Grab next available rule R from knowledge; unify with Goal
If R trivially satisfies Goal, return with success
If R is vocalize(X) then
Execute verbal output X (mode)
Record expectation
Receive response (mode)
Record implicit and explicit meanings for response
Transfer control depending on which expected response was received
Success response: Return with success
Negative response: No action
Confused response: Modify rule for clarification; prioritize for execution
Interrupt: Match response to expected response of another subdialog;
Go to that subdialog (mode)
If R is a general rule then
Store its antecedents
While there are more antecedents to process
Grab the next one and enter ZmodSubdialog with it
If the ZmodSubdialog exits with failure then terminate processing of R
If all antecedents of R succeed, return with success
Halt with failure
Multimodal Speech and Gesture
Interface Models
Introduction
• What are multimodal interfaces?
– Humans perceive the world through senses
• Ears (hearing), Eyes (sight), Nose (smell), Skin (touch) and
Tongue (taste)
• Communication through one sense is known as a mode
– Computers may process information through modes as
well
• Keyboards, Microphone, Mice, etc.
– Multimodal interfaces try to combine two different
modes of communicating.
• Slide borrowed from a talk on ‘Multimodal Interfaces’ by Joe Caloza
Advantages
• Combination of modalities allows more powerfully
expressive and transparent information seeking dialogues:
– Different modalities provide complimentary capabilities
• Users prefer speech input for functions like describing objects and events
and for issuing commands
• Pen input is preferred for conveying symbols, signs and gestures and for
pointing and selecting visible objects
– Multimodal pen/voice interaction can result in 10% faster task
completion time, 36% fewer task-critical content errors 50%
fewer spontaneous disfluencies and shorter and more simplified
linguistic constructions
– Corresponds to a 90-100% user preference to interact
multimodally
Advantages (2)
• Able to support superior error-handling compared
with unimodal recognition interfaces
– User-centric reasons
• Users will select the input mode that they judge to be less
error prone for a particular lexical context
• Users’ language is simplified when interacting multimodally
• Users have a strong tendency to switch modes after system
errors
– System-centric reasons
• A well-designed multimodal architecture can support mutual
disambiguation of input signals
Advantages (3)
• Allow users to exercise selection and control over
how they interact with the computer
• Hence can accommodate a broader range of users
– A visually impaired user may prefer speech input and
TTS output
– A user with a hearing impairment, strong accent, or a
cold may prefer pen input
• Multimodal interfaces are particularly suitable for
supporting mobile tasks such as communication
and personal navigation
Types of Multimodal Architecture
• Can be subdivided into two main types:
– Early Fusion
• Integrate signals at the feature level
• Based on Hidden Markov Models and Temporal Neural
Networks
• The recognition process in one mode influences the course of
recognition in the other
• Used for closely coupled and synchronized modalities (eg
speech and lip movement)
– Systems tend not to apply or generalize as well if the modes differ
substantially in the information content or time scale characteristics
• Require a large amount of training data to build the system.
Types of Multimodal Architecture
– Late Fusion
• Integrate information at a semantic level
• Use individual recognizers trained using unimodal data
• Systems based on semantic fusion can be scaled up easier
whether in the number of input modes or the size and type of
the vocabulary sets
• Require an architecture that supports fine-grained time
stamping of at least the beginning and end of each input
signal
– Required to figure out if two signals are part of a multimodal
construction or whether they should be interpreted as unimodal
commands.
Multimodal Architecture
Speech
Pen, Glove, Laser
Gesture
Recognition
Speech
Recognition
Gesture
Understanding
Context
Management
NLP
Multimodal Integration
Dialogue Manager
Graphics
Application Invocation
and Coordination
App1
App2
VR
TTS
Response Planning
App3
Applications
• OGI QuickSet System
– Enables a user to create and position entities on a map with
speech, pen-based gestures and direct manipulation.
• These entities are then used to initialize/run a simulation
• IBM’s Human-Centric Word Processor
– Combines Natural Language understanding with pen-based
pointing and selection gestures.
– Used to correct, manipulate and format text after it has been
entered
• Boeing’s Virtual Reality Aircraft Maintenance Training
Prototype
– Used for accessing the maintainability of new aircraft designs
and training mechanics in maintenance procedures using VR
Applications
• Meditor: Multimode Text Editor
– Combines keyboard, Braille terminal, a French text-tospeech synthesiser, and a speech recognition system.
– Allows Blind people to perform simple Document
editing tasks.
• MATCH
– Multimodal Access to City Help
– A Multimode Portable Device that accepts speech and
pen gestures created by ATT&T
– Allows mobile users to access restaurant and subway
information for New York City
Conclusion
• Multimodal systems are useful for a wide variety
of applications
• They provide increased robustness, ease of use
and flexibility.
• They provide accessibility to computer to a wider
and more diverse range of users
• However the area still needs a lot of research and a
lot of challenges need to be overcome
References
•
•
•
•
•
•
•
•
•
•
•
•
Harald Aust et al.: The Philips Automatic Train Timetable Information System, Speech Communication 17 (1995)
249-262
Gavin E. Churcher, Eric S Atwell, Clive Souter: Dialogue Management Systems: A Survey and Overview
University of Leeds, Research Report Series, Report 97.06, Feb 1997
Sharon J. Goldwater et al.: Building a Robust Dialogue System with Limited Data, ANLP-NAACL 2000 Workshop:
Conversational Systems
Staffan Larsson et al.: GoDiS- An Accommodating Dialogue System, ANLP-NAACL 2000 Workshop:
Conversational Systems
Diane J. Litman and Shimei Pan: Designing and Evaluating an Adaptive Spoken Dialogue System, User Modelling
and User-Adapted Interaction 12: 111-137, 2002
Michael F. McTear: Spoken Dialogue Technology: Enabling the conversational user interface, ACM Computing
Surveys, Vol 34, No. 1, March 2002, pp. 90-169
Mikio Nakano et al.: WIT: A toolkit for building robust and real-time spoken dialogue systems, 1st Sigdial
Workshop at ACL2000
Stephanie Seneff and Joseph Polifroni: Dialogue Management in the Mercury Flight Reservation System, ANLPNAACL 2000 Workshop: Conversational Systems, May 2000, pp 11-16
Satinder Singh et al.: Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun
System
Journal of Artificial Intelligence Research 16 (2002) 105-133
M. A. Walker et al.: Evaluating spoken dialogue agents with PARADISE: Two case studies, Computer Speech and
Language (1998) 12, 317-347
Wayne Ward and Bryan Pellom: The CU Communicator System, International Workshop on Automatic Speech
Recognition and Understanding (1999), Section 5
Sandra Williams: Dialogue Management in Mixed-initiative, Cooperative, Spoken Language System
11th Twente Workshop on Language Technology (TWLT11) Dialogue Management in Natural Language Systems,
Enschade, Netherlands, June 1996
Questions?
Download