Representation and Reasoning in a Multimodal Conversational Character Marc Cavazza School of Computing and Mathematics, University of Teesside Middlesbrough, TS1 3BA United Kingdom m.o.cavazza@tees.ac.uk Abstract We describe the reasoning mechanisms used in a fully-implemented dialogue system. This dialogue system, based on a speech acts formalism, supports a multimodal conversational character for Interactive Television. The system maintains an explicit representation of programme descriptions, which also constitutes an attentional structure. From the contents of this representation, it is possible to control various aspects of the dialogue process, from speech act identification to the multimodal presentation of the interface. 1 Introduction In this paper, we describe the practical reasoning adopted in a fully-implemented human-computer dialogue system. This system is a conversational character [Nagao and Takeuchi, 1994] [Beskow and McGlashan, 1997] [André et al., 1998] for interactive TV that assists the user in his choice of a TV programme through an Electronic Programme Guide (EPG). It is also a multimodal system, as human-computer dialogue is synchronised with the character’s non-verbal behaviour (i.e., facial expressions) and the display of background images corresponding to the programme categories being discussed at a given point in dialogue (though only system output is multimodal, input being through speech only). This system is based on the co-operative construction of a programme description from the expression of user preferences. The programme description constitutes in fact a representation of the current dialogue focus. This is a consequence of the task model for this specific information search dialogue, which is one of incremental search and construction of a programme description. In the next sections, after giving a brief overview of the system, we show that much of the practical reasoning can be based on the programme description, serving as a attentional structure [Wiebe et al., 1998]. This attentional structure is not a list of explicit entities but more a semantic structure characterising the current focus. We also describe the control of the multimodal interface and how it can be based on the focus representation and dialogue history, which constitute the two main representations used by the dialogue system. 2 System Overview The system is a mixed-initiative conversational interface organised around a human character with which the user communicates through speech recognition. The interface uses the Microsoft Agent™ system with a set of animated bitmaps acquired from a real human subject. The dialogue system is based on speech act theory [Austin, 1962] [Cohen and Perrault, 1979]. Each user utterance is interpreted in terms of the specific set of speech acts defined for the system (see below). Speech act identification is based on the semantic content of the user utterance. Once the speech act is identified, the programme description is updated accordingly and will serve for further comparisons in the subsequent rounds of dialogue. The system has been fully-implemented with a vocabulary of 300+ words, including few proper names (< 10 %). Future versions will essentially extend the vocabulary by increasing the number of proper names for cast, programme names, etc. [Cavazza, 2000a]. Figure 1 illustrates the linguistic processing step behind the system, namely the construction of a semantic representation from which the EPG is searched. This semantic representation serves as a basis for the incremental construction of the attentional structure. Figure 2 shows the user interface, which comprises the conversational character and background images illustrating the topics under discussion (according to the dialogue focus). Apart from the identification of speech acts, which is based on a specific set of rules, reasoning takes place in the system to decide on the following actions: • system replies to the user and whether the system should carry out a new programme guide search • non-verbal behaviour of the character • display of background images in connection with the current dialogue status • dialogue repair In the next sections, we describe these various reasoning procedures. F-1 P S-3 N* V N0 ↓ det F-4 N *N S-2 N a Is there PP movie Prep with (:Request) ((:category) (:movie)) N0 ↓ John Wayne ((:cast) (:John_Wayne)) User query processing éConnotation [Entertaining ] ù ê ú éGuidance : Family ù ú ê ê ú êPr eferences Cast : John Wayne ú ê úú ê êë Pay : nil úû ú ê ê ú Top : Movie é ùú ê êCategories ê Subcat _ 1 : Western ú ú ê úú ê êë Subcat _ 2 : Classic úû ú ê ê Selection [Rio _ Bravo] ú ë û Feature Structure Electronic Programme Guide Figure 1. From Parsing to EPG Search 3 3.1 Reasoning from the Attentional Structure Speech Act Identif ication The user utterance is interpreted as a speech act [Traum and Hinkelman, 1992] [Buseman et al., 1997]. The rationale for using speech acts is that they can categorise user reaction to the current dialogue focus, from which it is possible to generate an appropriate system response, in terms of EPG search or user reply. The illocutionary value of the user speech act is actually identified by using the attentional structure. To this extent, the semantic content of the current user utterance is compared with the semantic content of the attentional structure [Cavazza, 2000b]. Comparison of semantic features can identify user’s intentions such as acceptance, rejection or specification. This form of identification is well suited to incremental dialogue where new criteria are progressively introduced. The use of content and comparison of successive utterances for speech act identification follows previous work by Walker [1996] and Maier [1996]. Speech act identification determines the overall system response. This response comprises: i) searching the EPG ii) updating the attentional structure iii) user reply iv) when applicable, selecting an agent’s facial expression. There is a specific set of rules for updating the attentional structure associated with each type of speech act. These maintain dialogue consistency by ensuring for instance, that if a genre is rejected, its sub-genres should no longer appear in the attentional structure. The selection of user replies and the generation of multimodal output are discussed in sections 3.2 and 4 respectively. The main speech acts are: specification of a sub-category (“specify”), explicit rejection of a category (“reject”), implicit rejection of a category (“i-reject”, for instance, when the user has previously selected a western and asks for a comedy: “can I have a comedy instead?”) and the “another” speech act, which rejects the lowest-level category it mentions. Finally, a “standby” speech act is recognised when the user utterance does not provide new information (not being an implicit confirmation either). This serves to identify non-productive dialogue phases. For instance, the following dialogue illustrates the recognition of an “implicit rejection” (I-reject) speech act. The dialogue transcript format is based on the various stages of processing. The user utterance (User:) is processed by the speech recognition system to produce system input (Recognised:). The input is parsed into a semantic structure (Semantics:). A Filter is generated from this semantic structure to search the EPG (Filter:); the unification of filters also constitutes the attentional structure. Speech acts are identified by comparing the current filter with the attentional structure. As the new sug_genre requested by the user (“comedy”) is contradicting the one previously grounded in the attentional structure (“thriller”), the system identifies an implicit rejection [Searle, 1975] User: DO YOU HAVE ANY THRILLERS Recognised: DO YOU HAVE ANY THRILLER Semantics: ((QUESTION) (EXIST) (PROGRAMME ((SUB_GENRE THRILLER) (INDET)))) Filter: ((SUB_GENRE THRILLER)) Speech Act: (INITIAL (SUB_GENRE THRILLER) SEARCH) System: I found 5 programmes corresponding to that selection. What about: “12 Monkeys”? User: CAN I HAVE A COMEDY INSTEAD Recognised: CAN I HAVE A COMEDY THERE ON Semantics: (((QUESTION) (SUBJECT ((AUDIENCE USER))) (CHOICE+ ((VIEW))) (PROGRAMME ((SUB_GENRE COMEDY) (INDET))))) Filter: ((SUB_GENRE COMEDY)) Speech Act: (I-REJECT SUB_GENRE COMEDY SEARCH) 3.2 Replying and Dialogue Repair The believability of dialogue largely depends on the relevance of system replies. These should correspond to the perceived intentions of the user as identified through the recognised speech acts. Here, global dialogue consistency is maintained mostly through local reasoning, which is based on knowledge about the hierarchical organisation of categories in the EPG. This knowledge is used i) to generate appropriate responses ii) to determine if sufficient information is available for searching the EPG iii) to trigger dialogue repair when appropriate. For instance, explicit rejections (“not a western”, “I don’t like sitcoms”, “I don’t like James Woods”), are acknowledged by the system when proposing alternative choices: “I have found this programme which is not a western:”. This also confirms that the negative choice is grounded in discourse [Traum and Allen, 1992]. This is reflected internally by the construction of a “negative filter” containing explicitly rejected criteria, which remains valid for the duration of the current dialogue. There exists various levels of attentional focus, some are explicit and represented in the system, while others are implicit only, and as such not represented in the absence of previous grounding. User: IS THERE ANYTHING FUNNY I CAN WATCH Recognised: THE IS THERE ANYTHING FUNNY I CAN WATCH Semantics: (((QUESTION) (EXIST) (PROGRAMME ((CHOICE+))) (DET)) ((GOOD) (SUB_GENRE COMEDY)) ((AUDIENCE USER)) ((QUESTION)) ((VIEW))) Filter: ((SUB_GENRE COMEDY)) Speech Act: (INITIAL (SUB_GENRE COMEDY) SEARCH) System: I found 4 programmes corresponding to that selection. What about: “Starting Out”? User: NO I DON’T WANT A SITCOM Recognised: AN OLD I DON'T WANT A SITCOM Semantics: ((CHOICE+) (NEGATIVE) ((AUDIENCE USER)) (PROGRAMME ((SUB_GENRE COMEDY) (GENRE ENTERTAINMENT) (INDET)))) Filter: ((GENRE ENTERTAINMENT) (SUB_GENRE COMEDY) (NEGATIVE)) Speech Act: (REJECT GENRE NIL NO_SEARCH) System: What would you like to see, then? For instance, when proposing a programme on the basis of a high-level category (movie genre for example), there is implicit information, such as the movie cast, that can be known to the user or appear on background images while it has not been grounded in discourse and hence is not part of the current attentional structure. The introduction of any new category during dialogue triggers an EPG search in order to come with some proposal as early as possible. This is no longer the case when the user rejects high-level categories such as genre and subgenre. In the above dialogue, the user rejection (“I don’t want a sitcom”) rejects both the genre and subgenre, prompts the agent to take initiative in order to refocus the dialogue (“what would you like, then?”). Another traditional form of repair takes place when the dialogue does not appear to progress. For instance, if the contents of the user reply does not bring new information, the system will return to the user (“is this programme allright, then?”). Such a repair is used when the dialogue loses cohesion: it takes advantage of the linearity of focus in that application [Hobbs, 1978]. The various dialogue repair procedures currently implemented are mostly situational: they derive from the need to control dialogue progression. They are triggered by significant backtracking in dialogue (e.g., rejection of top-level categories) or unproductive dialogue (such as several utterances not contributing to programme description). 4 Controlling Multimodal Output Our system is a multimodal presentation system displaying i) various “meaningful” facial expressions for the talking character ii) background images corresponding to the topic(s) under discussion and iii) text echoing the agent’s synthesised speech. The talking character has been developed using the Microsoft Agent™ package as a software architecture [Trower, 2000]. This software provides an integration of character animation and Text-To-Speech, including automatic, though simplified, lipsync features (i.e., with a limited number of mouth shapes). To create the character, a human actor has been filmed against a blue background and video data acquired by chroma keying (Figure 2). The subject was instructed to adopt various facial expressions (happy, unhappy, surprised, etc.) and to read aloud word sequences, so that the set of mouth shapes required for lipsync could be recorded. The data has been converted into bitmap sequences and incorporated into Microsoft Agent™’s animation routines. This has produced mouth shapes to fit the lipsync facility, facial expressions to be displayed by the talking head and idle animations to be played between dialogue turns. This information presented to the user always emphasises the topic under discussion, i.e. the criterion that is being refined by the user, while also reminding the user about the categories selected as appearing in the attentional structure (through the still images appearing in the character’s background). Both contribute to the relevance of the situation, even in the case of partial understanding. 4.1 Background Images The choice of background images depends on the level of refinement reached by the dialogue. In that sense, they reflect the current focus of discussion, prior to the interpretation of the latest speech act. There exist different rules for displaying background images depending on the level of refinement of the current search. At the topmost level, when the user is discussing high-level categories, such as programme genres, the system displays a random selection of sample images corresponding to different genres. After having selected a category (e.g. “could I watch a movie tonight?”), the system can display a selection of the available subgenres. As the system is always offering a possible instance, early from the dialogue, in that case the specific instance would be part of the selection. Once a specific sub-genre is discussed, the system can display several instance programmes, again including the one it might be suggesting to the user as a first choice. One important aspect of background images is that they may constitute suggestions as a complementary channel to speech. Spoken suggestions generally take place at programme instance level, once a set of well-identified programmes matches the user criteria. On the other hand, graphic display constitutes implicit suggestions. One example consists, when a top-level category such as programme genre is under discussion, in proposing hints at the sub-genre. For instance, in the sample dialogue below, the first exchange is referring to the genre category “sports”. The interface, when replying to the user can display in its background a sample of the available sub-genres, e.g. football, cricket, Formula 1 racing (Figure 2). There is thus a difference in focus and contents between the explicit representation (attentional structure) and the background images, as these can contain information not grounded in discourse. However the complementarity between the two channels eventually enhance the expressivity of the interface. I have 5 Programmes for that selection Figure 2. The User Interface with Background Images. User: WHAT KIND OF SPORTS PROGRAMME DO YOU HAVE Recognised: WHAT KIND OF SPORTS PROGRAMME YOU HAVE Semantics: (((QUESTION) (EXIST) (PROGRAMME ((PROGRAMME) (GENRE SPORT) (INDET)))) ((V_SPEAKER)) ((VIEW))) Filter: ((GENRE SPORT)) Speech Act: (INITIAL (GENRE SPORT) SEARCH) System: I have 5 programmes for that selection. Would you like to watch “Keegan's Greatest Games”? User: CAN I HAVE SOME CRICKET INSTEAD Recognised: CAN I HAVE SAME CRICKET IS THERE Semantics: (((QUESTION) (SUBJECT ((AUDIENCE USER))) (CHOICE+ ((VIEW)))) ((SUB_GENRE CRICKET)) ((QUESTION) (EXIST))) Filter: ((GENRE SPORT) (SUB_GENRE CRICKET)) Speech Act: (SPECIFY SUB_GENRE CRICKET SEARCH) 4.2 Non-Verbal Behaviour Non-verbal behaviour constitutes an additional channel through which the user can receive information. In addition the non-verbal channel does not disrupt the overall course of dialogue. In our system, non-verbal behaviour is based on a small set of facial expressions that express wonder, happiness or sadness (Figure 3). The relations between facial expressions and dialogue can be determined by local or global dialogue conditions. Local conditions are those based on a single user utterance. For instance, a low speech recognition confidence score, or an empty semantic structure can directly trigger a perplexed facial expression for the agent. Similarly, when the user is backtracking from previous choices and the system cannot readily make a suggestion, the repair prompt (“what kind of programme do you want, then?”) is best accompanied by an appropriate facial expression, to emphasise the situation. User acceptance of the suggested programme is welcomed with cheerful greetings. Global conditions are related to the control of dialogue progression: facial expressions can be used to send nonverbal signals giving feedback on the overall dialogue progression, based on dialogue history (see below). For instance, the character might look more worried as the dialogue does not appear productive. However in some cases a specific verbal intervention may be required. The important aspect is to know when to insert it in the dialogue; as some of these expressions might irritate the user [Fischer and Batliner, 2000], non-verbal behaviour might be a privileged mode of providing feedback to the user. prises the type of speech act together with its parameters (e.g. Reject genre). The history for the above dialogue is: Figure 3. Non-verbal Behaviour. 4.3 Global Reasoning and Dialogue History Most of the control mechanisms described so far (for user replies, dialogue repair, and non-verbal behaviour) are essentially local in nature. Even if, in this specific context, most actions can be determined as a function of the latest speech acts, there is still a need to monitor global dialogue progression. The dialogue history stores the sequence of user speech acts as recognised by the system. This makes possible not only to monitor dialogue length, but to look for any patterns indicating nonproductive dialogue or a user’s difficulty to make up his mind. This can then in turn trigger a non-verbal or a verbal response from the system. This can be evidenced on the following dialogue1. User: Do you have any sports programmes? System: I have found 5 programmes for this choice. Would you like to watch: “Keegan's Greatest Games”? User: Can I have some cricket instead? System: I can suggest this programme with cricket: “Sportstalk” User: Who is conducting? System: Sportstalk features Donald User: Anything else? System: I have no other programmes for that selection User: Is there anything funny then? System: I can recommend the following comedy: “Starting Out” User: I do not want a sitcom System: What would you like to see, then ? User: A funny movie System: I would suggest this programme with comedy: “Analyze This” User: What is its rating? System: Its rating is family User: Can I have another one? System: There are 1 other programmes for this choice What about the following programme: “Casablanca”? User: I want a western instead System: I would suggest the following western: “Unforgiven” The dialogue history is a stack of speech acts recognised by the system in the internal system format, which com1 Unlike the other examples in this paper, this dialogue has been obtained with keyboard input. ((I-REJECT SUB_GENRE WESTERN SEARCH) (REJECT INSTANCE TOP NOSEARCH) (PAR_RATING) (SPECIFY SUB_GENRE COMEDY SEARCH) (REJECT GENRE NIL NOSEARCH) (I-REJECT SUB_GENRE COMEDY SEARCH) (REJECT INSTANCE TOP NOSEARCH) (CAST) (SPECIFY SUB_GENRE CRICKET SEARCH) (INITIAL (SPORT CRICKET POSITIVE) SEARCH)) The dialogue history is used as a central data structure to measure dialogue progression and evaluate the quality of dialogue. Besides measuring the overall dialogue length and the number of proposals rejected, it supports the identification of various rejection patterns and, as a research tool within our specific framework, the exploration of user behaviour, formalised through speech acts. For instance, while, in the above dialogue, the system only triggered dialogue repair once, there is a succession of instance rejections followed by sub-genre rejections, which would indicate difficulties in user choice. At this stage, our exploration of rejection patterns has remained largely empirical and has not been related to specific dialogue theories. 5 Conclusion The rationale for the use of human-computer dialogue in Interactive TV is that it breaks down the information exchange between the user and the system into manageable units. This is especially relevant considering the many categories and criteria that can be used for selection and also the fact that the user may not have a fixed choice a priori. The progressive refinement of the user selection (which is by no means a monotonic process) is reflected in an explicit representation. This representation serves as an attentional structure. This attentional structure is not an explicit list of entities but more a “semantic filter” characterising these entities. We have found a consistent way of approaching dialogue management in this system: however, this was facilitated by specific properties of the task, as well as the hierarchical organisation of the EPG. We cannot thus claim that our approach can be generalised without some caution. Acknowledgments This work is part of the “Virtual interactive Presenter” project, funded by the DTI under the “LINK Broadcast” Programme. Schedule data as well as images have been provided by the BBC. The user interface and animated character have been developed by Advance Multimedia Communications plc. References [André et al., 1998]. Elisabeth André, Thomas Rist and Juergen Muller. Guiding the User Trough Dynamically Generated Hypermedia Presentations with a Life-Like Character. Proceedings of the International Conference on Intelligent User Interfaces (IUI-98), San Francisco, USA. [Austin, 1962]. John Austin. How to Do Things with Words. Oxford, Oxford University Press, 1962. [Beskow and McGlashan, 1997]. Jonas Beskow and Scott McGlashan. Olga: A Conversational Agent with Gestures. In: Proceedings of the IJCAI'97 workshop on Animated Interface Agents - Making them Intelligent, Nagoya, Japan, August 1997. [Busemann et al., 1997]. Stephan Busemann, Thierry Declerck, Abdel Kader Diagne, Luca Dini, Judith Klein and Sven Scheimer. Natural Language Dialogue Service for Appointment Scheduling Agents. In: Proceedings of ANLP'97, Washington DC, USA., 1997. [Cavazza, 2000a] Marc Cavazza. Human-Computer Conversation for Interactive Television. In: Proceedings of the Third International Workshop on Human-Computer Conversation, Bellagio, Italy, pp. 42-47, July 2000. [Cavazza, 2000b] Marc Cavazza. From Speech Acts to Search Acts: a Semantic Approach to Speech Act Recognition. Proceedings of GOTALOG 2000, Gothenburg, Sweden, pp. 187-190, June 2000. [Cohen and Perrault, 1979]. Philip R. Cohen and C. Raymond Perrault. Elements of a plan-based theory of speech acts. Cognitive Science, 3(3), pp. 177-212, 1979. [Fischer and Batliner, 2000]. Kerstin Fischer and Anton Batliner. What Makes Speaker Angry in HumanComputer Conversation. In: Proceedings of the Third International Workshop on Human-Computer Conversation, Bellagio, Italy, pp. 62-67, July 2000. [Hobbs, 1978]. Jerry Hobbs. Resolving Pronoun References. Lingua, 44, pp. 311-338, 1978. [Maier, 1996] Elisabeth Maier. Context Construction as Subtask of Dialogue Processing: the VERBMOBIL Case. Proceedings of the Eleventh Twente Workshop on Language Technologies (TWLT-11), Dialogue Management in Natural Language Systems, University of Twente, The Netherlands, pp. 113-122, 1996. [Nagao and Takeuchi, 1994] Katashi Nagao and Akikazu Takeuchi. Speech Dialogue with Facial Displays: Multimodal Human-Computer Conversation. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL'94), pp. 102-109, 1994. [Searle, 1975] John Searle. Indirect Spech Acts. In: P. Cole and J.L. Morgan (Eds.), Syntax and Semantics, vol. 3: Speech Acts, pp. 59-82, New York, Academic Press. [Traum and Allen, 1992]. David Traum and James Allen. A Speech Acts Approach to Grounding in Conversation. Proceedings of the International Conference on Spoken Language Processing (ICSLP’92), pp. 137-140, 1992. [Traum and Hinkelman, 1992]. David Traum and Elisabeth A. Hinkelman. Conversation Acts in task-Oriented Spoken Dialogue. Computational Intelligence, vol. 8, n. 3, 1992. [Trower, 2000]. Trower, Tandy. Microsoft Agent. http://www.microsoft.com/msagent/ Microsoft Corporation, March 2000. [Walker, 1996]. Marilyn Walker. Inferring Acceptance and Rejection in Dialogue by Default Rules of Inference. Language and Speech, 39-2, 1996. [Wiebe et al., 1998] Janyce M. Wiebe, Thomas P. O’Hara, thorsten Ohrstrom-Sandgren and Kenneth J. McKeever. An Empirical Approach to Temporal Reference Resolution. Journal of Artificial Intelligence Research 9, pp. 247-293, 1998.