Abstract

advertisement
Representation and Reasoning in a Multimodal Conversational Character
Marc Cavazza
School of Computing and Mathematics, University of Teesside
Middlesbrough, TS1 3BA
United Kingdom
m.o.cavazza@tees.ac.uk
Abstract
We describe the reasoning mechanisms used in a
fully-implemented dialogue system. This dialogue system, based on a speech acts formalism,
supports a multimodal conversational character
for Interactive Television. The system maintains
an explicit representation of programme descriptions, which also constitutes an attentional
structure. From the contents of this representation, it is possible to control various aspects of
the dialogue process, from speech act identification to the multimodal presentation of the interface.
1 Introduction
In this paper, we describe the practical reasoning
adopted in a fully-implemented human-computer dialogue system. This system is a conversational character
[Nagao and Takeuchi, 1994] [Beskow and McGlashan,
1997] [André et al., 1998] for interactive TV that assists
the user in his choice of a TV programme through an
Electronic Programme Guide (EPG). It is also a multimodal system, as human-computer dialogue is synchronised with the character’s non-verbal behaviour (i.e.,
facial expressions) and the display of background images
corresponding to the programme categories being discussed at a given point in dialogue (though only system
output is multimodal, input being through speech only).
This system is based on the co-operative construction of
a programme description from the expression of user
preferences. The programme description constitutes in
fact a representation of the current dialogue focus. This
is a consequence of the task model for this specific information search dialogue, which is one of incremental
search and construction of a programme description. In
the next sections, after giving a brief overview of the
system, we show that much of the practical reasoning can
be based on the programme description, serving as a attentional structure [Wiebe et al., 1998]. This attentional
structure is not a list of explicit entities but more a semantic structure characterising the current focus. We
also describe the control of the multimodal interface and
how it can be based on the focus representation and dialogue history, which constitute the two main representations used by the dialogue system.
2
System Overview
The system is a mixed-initiative conversational interface
organised around a human character with which the user
communicates through speech recognition. The interface
uses the Microsoft Agent™ system with a set of animated bitmaps acquired from a real human subject.
The dialogue system is based on speech act theory [Austin, 1962] [Cohen and Perrault, 1979]. Each user utterance is interpreted in terms of the specific set of speech
acts defined for the system (see below). Speech act identification is based on the semantic content of the user
utterance. Once the speech act is identified, the programme description is updated accordingly and will
serve for further comparisons in the subsequent rounds
of dialogue. The system has been fully-implemented
with a vocabulary of 300+ words, including few proper
names (< 10 %). Future versions will essentially extend
the vocabulary by increasing the number of proper names
for cast, programme names, etc. [Cavazza, 2000a].
Figure 1 illustrates the linguistic processing step behind
the system, namely the construction of a semantic representation from which the EPG is searched. This semantic
representation serves as a basis for the incremental construction of the attentional structure. Figure 2 shows the
user interface, which comprises the conversational character and background images illustrating the topics under
discussion (according to the dialogue focus).
Apart from the identification of speech acts, which is
based on a specific set of rules, reasoning takes place in
the system to decide on the following actions:
• system replies to the user and whether the system
should carry out a new programme guide search
• non-verbal behaviour of the character
• display of background images in connection with the
current dialogue status
• dialogue repair
In the next sections, we describe these various reasoning
procedures.
F-1
P
S-3
N*
V
N0 ↓
det
F-4
N
*N
S-2
N
a
Is there
PP
movie
Prep
with
(:Request)
((:category)
(:movie))
N0 ↓
John Wayne
((:cast)
(:John_Wayne))
User query processing
éConnotation [Entertaining ]
ù
ê
ú
éGuidance : Family ù ú
ê
ê
ú
êPr eferences Cast : John Wayne ú
ê
úú
ê
êë Pay : nil
úû ú
ê
ê
ú
Top : Movie
é
ùú
ê
êCategories ê Subcat _ 1 : Western ú ú
ê
úú
ê
êë Subcat _ 2 : Classic úû ú
ê
ê Selection [Rio _ Bravo]
ú
ë
û
Feature Structure
Electronic Programme Guide
Figure 1. From Parsing to EPG Search
3
3.1
Reasoning from the Attentional
Structure
Speech Act Identif ication
The user utterance is interpreted as a speech act [Traum
and Hinkelman, 1992] [Buseman et al., 1997]. The rationale for using speech acts is that they can categorise
user reaction to the current dialogue focus, from which it
is possible to generate an appropriate system response, in
terms of EPG search or user reply.
The illocutionary value of the user speech act is actually
identified by using the attentional structure. To this extent, the semantic content of the current user utterance is
compared with the semantic content of the attentional
structure [Cavazza, 2000b]. Comparison of semantic
features can identify user’s intentions such as acceptance, rejection or specification. This form of identification is well suited to incremental dialogue where new
criteria are progressively introduced. The use of content
and comparison of successive utterances for speech act
identification follows previous work by Walker [1996]
and Maier [1996].
Speech act identification determines the overall system
response. This response comprises: i) searching the EPG
ii) updating the attentional structure iii) user reply iv)
when applicable, selecting an agent’s facial expression.
There is a specific set of rules for updating the attentional structure associated with each type of speech act.
These maintain dialogue consistency by ensuring for instance, that if a genre is rejected, its sub-genres should
no longer appear in the attentional structure. The selection of user replies and the generation of multimodal
output are discussed in sections 3.2 and 4 respectively.
The main speech acts are: specification of a sub-category
(“specify”), explicit rejection of a category (“reject”),
implicit rejection of a category (“i-reject”, for instance,
when the user has previously selected a western and asks
for a comedy: “can I have a comedy instead?”) and the
“another” speech act, which rejects the lowest-level
category it mentions. Finally, a “standby” speech act is
recognised when the user utterance does not provide new
information (not being an implicit confirmation either).
This serves to identify non-productive dialogue phases.
For instance, the following dialogue illustrates the recognition of an “implicit rejection” (I-reject) speech act.
The dialogue transcript format is based on the various
stages of processing. The user utterance (User:) is processed by the speech recognition system to produce system input (Recognised:). The input is parsed into a semantic structure (Semantics:). A Filter is generated from
this semantic structure to search the EPG (Filter:); the
unification of filters also constitutes the attentional
structure. Speech acts are identified by comparing the
current filter with the attentional structure. As the new
sug_genre requested by the user (“comedy”) is contradicting the one previously grounded in the attentional
structure (“thriller”), the system identifies an implicit
rejection [Searle, 1975]
User:
DO YOU HAVE ANY THRILLERS
Recognised: DO YOU HAVE ANY THRILLER
Semantics: ((QUESTION) (EXIST)
(PROGRAMME ((SUB_GENRE THRILLER)
(INDET))))
Filter: ((SUB_GENRE THRILLER))
Speech Act: (INITIAL (SUB_GENRE THRILLER)
SEARCH)
System: I found 5 programmes corresponding to that selection. What about: “12 Monkeys”?
User:
CAN I HAVE A COMEDY INSTEAD
Recognised: CAN I HAVE A COMEDY THERE ON
Semantics: (((QUESTION)
(SUBJECT
((AUDIENCE
USER)))
(CHOICE+ ((VIEW)))
(PROGRAMME
((SUB_GENRE COMEDY) (INDET)))))
Filter: ((SUB_GENRE COMEDY))
Speech Act: (I-REJECT
SUB_GENRE
COMEDY
SEARCH)
3.2
Replying and Dialogue Repair
The believability of dialogue largely depends on the
relevance of system replies. These should correspond to
the perceived intentions of the user as identified through
the recognised speech acts. Here, global dialogue consistency is maintained mostly through local reasoning,
which is based on knowledge about the hierarchical organisation of categories in the EPG. This knowledge is
used i) to generate appropriate responses ii) to determine
if sufficient information is available for searching the
EPG iii) to trigger dialogue repair when appropriate. For
instance, explicit rejections (“not a western”, “I don’t
like sitcoms”, “I don’t like James Woods”), are acknowledged by the system when proposing alternative choices:
“I have found this programme which is not a western:”.
This also confirms that the negative choice is grounded
in discourse [Traum and Allen, 1992]. This is reflected
internally by the construction of a “negative filter” containing explicitly rejected criteria, which remains valid
for the duration of the current dialogue. There exists
various levels of attentional focus, some are explicit and
represented in the system, while others are implicit only,
and as such not represented in the absence of previous
grounding.
User: IS THERE ANYTHING FUNNY I CAN WATCH
Recognised: THE IS THERE ANYTHING FUNNY I
CAN WATCH
Semantics: (((QUESTION) (EXIST)
(PROGRAMME ((CHOICE+))) (DET))
((GOOD)
(SUB_GENRE
COMEDY))
((AUDIENCE USER)) ((QUESTION))
((VIEW)))
Filter: ((SUB_GENRE COMEDY))
Speech Act: (INITIAL
(SUB_GENRE
COMEDY)
SEARCH)
System: I found 4 programmes corresponding to that selection. What about: “Starting Out”?
User:
NO I DON’T WANT A SITCOM
Recognised: AN OLD I DON'T WANT A SITCOM
Semantics: ((CHOICE+) (NEGATIVE)
((AUDIENCE USER))
(PROGRAMME
((SUB_GENRE COMEDY)
(GENRE
ENTERTAINMENT)
(INDET))))
Filter: ((GENRE ENTERTAINMENT)
(SUB_GENRE COMEDY) (NEGATIVE))
Speech Act: (REJECT GENRE NIL NO_SEARCH)
System: What would you like to see, then?
For instance, when proposing a programme on the basis
of a high-level category (movie genre for example), there
is implicit information, such as the movie cast, that can
be known to the user or appear on background images
while it has not been grounded in discourse and hence is
not part of the current attentional structure.
The introduction of any new category during dialogue
triggers an EPG search in order to come with some proposal as early as possible. This is no longer the case
when the user rejects high-level categories such as genre
and subgenre. In the above dialogue, the user rejection
(“I don’t want a sitcom”) rejects both the genre and subgenre, prompts the agent to take initiative in order to refocus the dialogue (“what would you like, then?”). Another traditional form of repair takes place when the
dialogue does not appear to progress. For instance, if the
contents of the user reply does not bring new information, the system will return to the user (“is this programme allright, then?”). Such a repair is used when the
dialogue loses cohesion: it takes advantage of the linearity of focus in that application [Hobbs, 1978]. The various dialogue repair procedures currently implemented
are mostly situational: they derive from the need to control dialogue progression. They are triggered by significant backtracking in dialogue (e.g., rejection of top-level
categories) or unproductive dialogue (such as several
utterances not contributing to programme description).
4
Controlling Multimodal Output
Our system is a multimodal presentation system displaying i) various “meaningful” facial expressions for the
talking character ii) background images corresponding to
the topic(s) under discussion and iii) text echoing the
agent’s synthesised speech.
The talking character has been developed using the Microsoft Agent™ package as a software architecture
[Trower, 2000]. This software provides an integration of
character animation and Text-To-Speech, including
automatic, though simplified, lipsync features (i.e., with
a limited number of mouth shapes).
To create the character, a human actor has been filmed
against a blue background and video data acquired by
chroma keying (Figure 2). The subject was instructed to
adopt various facial expressions (happy, unhappy, surprised, etc.) and to read aloud word sequences, so that
the set of mouth shapes required for lipsync could be
recorded. The data has been converted into bitmap sequences and incorporated into Microsoft Agent™’s animation routines. This has produced mouth shapes to fit
the lipsync facility, facial expressions to be displayed by
the talking head and idle animations to be played between dialogue turns.
This information presented to the user always emphasises
the topic under discussion, i.e. the criterion that is being
refined by the user, while also reminding the user about
the categories selected as appearing in the attentional
structure (through the still images appearing in the character’s background). Both contribute to the relevance of
the situation, even in the case of partial understanding.
4.1
Background Images
The choice of background images depends on the level
of refinement reached by the dialogue. In that sense, they
reflect the current focus of discussion, prior to the interpretation of the latest speech act. There exist different
rules for displaying background images depending on the
level of refinement of the current search. At the topmost
level, when the user is discussing high-level categories,
such as programme genres, the system displays a random
selection of sample images corresponding to different
genres. After having selected a category (e.g. “could I
watch a movie tonight?”), the system can display a selection of the available subgenres. As the system is always offering a possible instance, early from the dialogue, in that case the specific instance would be part of
the selection. Once a specific sub-genre is discussed, the
system can display several instance programmes, again
including the one it might be suggesting to the user as a
first choice. One important aspect of background images
is that they may constitute suggestions as a complementary channel to speech. Spoken suggestions generally
take place at programme instance level, once a set of
well-identified programmes matches the user criteria. On
the other hand, graphic display constitutes implicit suggestions. One example consists, when a top-level category such as programme genre is under discussion, in
proposing hints at the sub-genre. For instance, in the
sample dialogue below, the first exchange is referring to
the genre category “sports”. The interface, when replying
to the user can display in its background a sample of the
available sub-genres, e.g. football, cricket, Formula 1
racing (Figure 2). There is thus a difference in focus and
contents between the explicit representation (attentional
structure) and the background images, as these can contain information not grounded in discourse. However the
complementarity between the two channels eventually
enhance the expressivity of the interface.
I have 5 Programmes for that selection
Figure 2. The User Interface with Background Images.
User: WHAT KIND OF SPORTS PROGRAMME DO
YOU HAVE
Recognised: WHAT KIND OF SPORTS PROGRAMME
YOU HAVE
Semantics: (((QUESTION) (EXIST)
(PROGRAMME
((PROGRAMME)
(GENRE
SPORT)
(INDET))))
((V_SPEAKER)) ((VIEW)))
Filter: ((GENRE SPORT))
Speech Act: (INITIAL (GENRE SPORT) SEARCH)
System: I have 5 programmes for that selection. Would
you like to watch “Keegan's Greatest Games”?
User:
CAN I HAVE SOME CRICKET INSTEAD
Recognised: CAN I HAVE SAME CRICKET IS THERE
Semantics: (((QUESTION)
(SUBJECT ((AUDIENCE USER)))
(CHOICE+ ((VIEW))))
((SUB_GENRE
CRICKET))
((QUESTION) (EXIST)))
Filter: ((GENRE SPORT) (SUB_GENRE CRICKET))
Speech Act: (SPECIFY
SUB_GENRE
CRICKET
SEARCH)
4.2
Non-Verbal Behaviour
Non-verbal behaviour constitutes an additional channel
through which the user can receive information. In addition the non-verbal channel does not disrupt the overall
course of dialogue. In our system, non-verbal behaviour
is based on a small set of facial expressions that express
wonder, happiness or sadness (Figure 3).
The relations between facial expressions and dialogue
can be determined by local or global dialogue conditions.
Local conditions are those based on a single user utterance. For instance, a low speech recognition confidence
score, or an empty semantic structure can directly trigger
a perplexed facial expression for the agent. Similarly,
when the user is backtracking from previous choices and
the system cannot readily make a suggestion, the repair
prompt (“what kind of programme do you want, then?”)
is best accompanied by an appropriate facial expression,
to emphasise the situation. User acceptance of the suggested programme is welcomed with cheerful greetings.
Global conditions are related to the control of dialogue
progression: facial expressions can be used to send nonverbal signals giving feedback on the overall dialogue
progression, based on dialogue history (see below). For
instance, the character might look more worried as the
dialogue does not appear productive. However in some
cases a specific verbal intervention may be required.
The important aspect is to know when to insert it in the
dialogue; as some of these expressions might irritate the
user [Fischer and Batliner, 2000], non-verbal behaviour
might be a privileged mode of providing feedback to the
user.
prises the type of speech act together with its parameters
(e.g. Reject genre). The history for the above dialogue
is:
Figure 3. Non-verbal Behaviour.
4.3
Global Reasoning and Dialogue History
Most of the control mechanisms described so far (for
user replies, dialogue repair, and non-verbal behaviour)
are essentially local in nature. Even if, in this specific
context, most actions can be determined as a function of
the latest speech acts, there is still a need to monitor
global dialogue progression. The dialogue history stores
the sequence of user speech acts as recognised by the
system. This makes possible not only to monitor dialogue
length, but to look for any patterns indicating nonproductive dialogue or a user’s difficulty to make up his
mind. This can then in turn trigger a non-verbal or a verbal response from the system.
This can be evidenced on the following dialogue1.
User: Do you have any sports programmes?
System: I have found 5 programmes for this choice.
Would you like to watch: “Keegan's Greatest Games”?
User: Can I have some cricket instead?
System: I can suggest this programme with cricket:
“Sportstalk”
User: Who is conducting?
System: Sportstalk features Donald
User: Anything else?
System: I have no other programmes for that selection
User: Is there anything funny then?
System: I can recommend the following comedy: “Starting Out”
User: I do not want a sitcom
System: What would you like to see, then ?
User: A funny movie
System: I would suggest this programme with comedy:
“Analyze This”
User: What is its rating?
System: Its rating is family
User: Can I have another one?
System: There are 1 other programmes for this choice
What about the following programme: “Casablanca”?
User: I want a western instead
System: I would suggest the following western: “Unforgiven”
The dialogue history is a stack of speech acts recognised
by the system in the internal system format, which com1
Unlike the other examples in this paper, this dialogue has
been obtained with keyboard input.
((I-REJECT SUB_GENRE WESTERN SEARCH)
(REJECT INSTANCE TOP NOSEARCH)
(PAR_RATING)
(SPECIFY SUB_GENRE COMEDY SEARCH)
(REJECT GENRE NIL NOSEARCH)
(I-REJECT SUB_GENRE COMEDY SEARCH)
(REJECT INSTANCE TOP NOSEARCH)
(CAST)
(SPECIFY SUB_GENRE CRICKET SEARCH)
(INITIAL (SPORT CRICKET POSITIVE) SEARCH))
The dialogue history is used as a central data structure to
measure dialogue progression and evaluate the quality of
dialogue. Besides measuring the overall dialogue length
and the number of proposals rejected, it supports the
identification of various rejection patterns and, as a research tool within our specific framework, the exploration of user behaviour, formalised through speech acts.
For instance, while, in the above dialogue, the system
only triggered dialogue repair once, there is a succession
of instance rejections followed by sub-genre rejections,
which would indicate difficulties in user choice. At this
stage, our exploration of rejection patterns has remained
largely empirical and has not been related to specific
dialogue theories.
5
Conclusion
The rationale for the use of human-computer dialogue in
Interactive TV is that it breaks down the information
exchange between the user and the system into manageable units. This is especially relevant considering the
many categories and criteria that can be used for selection and also the fact that the user may not have a fixed
choice a priori. The progressive refinement of the user
selection (which is by no means a monotonic process) is
reflected in an explicit representation. This representation serves as an attentional structure. This attentional
structure is not an explicit list of entities but more a “semantic filter” characterising these entities. We have
found a consistent way of approaching dialogue management in this system: however, this was facilitated by
specific properties of the task, as well as the hierarchical
organisation of the EPG. We cannot thus claim that our
approach can be generalised without some caution.
Acknowledgments
This work is part of the “Virtual interactive Presenter”
project, funded by the DTI under the “LINK Broadcast”
Programme. Schedule data as well as images have been
provided by the BBC. The user interface and animated
character have been developed by Advance Multimedia
Communications plc.
References
[André et al., 1998]. Elisabeth André, Thomas Rist and
Juergen Muller. Guiding the User Trough Dynamically
Generated Hypermedia Presentations with a Life-Like
Character. Proceedings of the International Conference
on Intelligent User Interfaces (IUI-98), San Francisco,
USA.
[Austin, 1962]. John Austin. How to Do Things with
Words. Oxford, Oxford University Press, 1962.
[Beskow and McGlashan, 1997]. Jonas Beskow and
Scott McGlashan. Olga: A Conversational Agent with
Gestures. In: Proceedings of the IJCAI'97 workshop on
Animated Interface Agents - Making them Intelligent,
Nagoya, Japan, August 1997.
[Busemann et al., 1997]. Stephan Busemann, Thierry
Declerck, Abdel Kader Diagne, Luca Dini, Judith Klein
and Sven Scheimer. Natural Language Dialogue Service
for Appointment Scheduling Agents. In: Proceedings of
ANLP'97, Washington DC, USA., 1997.
[Cavazza, 2000a] Marc Cavazza. Human-Computer Conversation for Interactive Television. In: Proceedings of
the Third International Workshop on Human-Computer
Conversation, Bellagio, Italy, pp. 42-47, July 2000.
[Cavazza, 2000b] Marc Cavazza. From Speech Acts to
Search Acts: a Semantic Approach to Speech Act Recognition. Proceedings of GOTALOG 2000, Gothenburg,
Sweden, pp. 187-190, June 2000.
[Cohen and Perrault, 1979]. Philip R. Cohen and C.
Raymond Perrault. Elements of a plan-based theory of
speech acts. Cognitive Science, 3(3), pp. 177-212, 1979.
[Fischer and Batliner, 2000]. Kerstin Fischer and Anton
Batliner. What Makes Speaker Angry in HumanComputer Conversation. In: Proceedings of the Third
International Workshop on Human-Computer Conversation, Bellagio, Italy, pp. 62-67, July 2000.
[Hobbs, 1978]. Jerry Hobbs. Resolving Pronoun References. Lingua, 44, pp. 311-338, 1978.
[Maier, 1996] Elisabeth Maier. Context Construction as
Subtask of Dialogue Processing: the VERBMOBIL Case.
Proceedings of the Eleventh Twente Workshop on Language Technologies (TWLT-11), Dialogue Management
in Natural Language Systems, University of Twente, The
Netherlands, pp. 113-122, 1996.
[Nagao and Takeuchi, 1994] Katashi Nagao and Akikazu
Takeuchi. Speech Dialogue with Facial Displays: Multimodal Human-Computer Conversation. In: Proceedings
of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL'94), pp. 102-109, 1994.
[Searle, 1975] John Searle. Indirect Spech Acts. In: P.
Cole and J.L. Morgan (Eds.), Syntax and Semantics, vol.
3: Speech Acts, pp. 59-82, New York, Academic Press.
[Traum and Allen, 1992]. David Traum and James Allen.
A Speech Acts Approach to Grounding in Conversation.
Proceedings of the International Conference on Spoken
Language Processing (ICSLP’92), pp. 137-140, 1992.
[Traum and Hinkelman, 1992]. David Traum and Elisabeth A. Hinkelman. Conversation Acts in task-Oriented
Spoken Dialogue. Computational Intelligence, vol. 8, n.
3, 1992.
[Trower, 2000]. Trower, Tandy. Microsoft Agent.
http://www.microsoft.com/msagent/ Microsoft Corporation, March 2000.
[Walker, 1996]. Marilyn Walker. Inferring Acceptance
and Rejection in Dialogue by Default Rules of Inference.
Language and Speech, 39-2, 1996.
[Wiebe et al., 1998] Janyce M. Wiebe, Thomas P.
O’Hara, thorsten Ohrstrom-Sandgren and Kenneth J.
McKeever. An Empirical Approach to Temporal Reference Resolution. Journal of Artificial Intelligence Research 9, pp. 247-293, 1998.
Download