Sonderforschungsbereich 314 Künstliche Intelligenz - Wissensbasierte Systeme KI-Labor am Lehrstuhl für Informatik IV Leitung: Prof. Dr. W. Wahlster VITRA Universität des Saarlandes FB 14 Informatik IV Postfach 151150 D-66041 Saarbrücken Fed. Rep. of Germany Tel. 0681 / 302-2363 Bericht Nr. 103 Multimedia Presentation of Interpreted Visual Data Elisabeth André, Gerd Herzog, Thomas Rist Juni 1994 ISSN 0944-7814 103 Multimedia Presentation of Interpreted Visual Data Elisabeth Andre, Gerd Herzog, Thomas Rist German Research Center for Articial Intelligence (DFKI) D-66123 Saarbrucken, Germany fandre, ristg@dfki.uni-sb.de SFB 314, Project VITRA, Universitat des Saarlandes D-66041 Saarbrucken, Germany herzog@cs.uni-sb.de Juni 1994 Abstract While computer vision aims at the transformation of image data into meaningful information, research in intelligent multimedia generation addresses the eective communication of information using multiple media such as text, graphics and video. We argue that combining the two research areas leads to an interesting new kind of information system. Such integrated systems will be able to exibly transform visual data into various presentation forms including, for example, TV-style reports and illustrated articles. The paper elaborates on this transformation and provides a modularization into maintainable subtasks. How these subtasks can be accomplished will be sketched by means of Vips, a prototype system that has emerged from our previous work on scene analysis and multimedia generation. Vips analyses short sections of camera-recorded image sequences of soccer games and generates multimedia presentations of the interpreted visual data. To appear in: Proc. of AAAI-94, Workshop on \Integration of Natural Language and Vision Processing", Seattle, WA, 1994. 1 1 Introduction Image understanding systems which perform a qualitative interpretation of a continuous ow of visual data allow observation of inaccessible areas and will release humans from time-consuming and often boring observation tasks, e.g, in trac control. Moreover, a sophisticated system may not only collect and condense data but also interpret them in a particular context and provide information that goes far beyond the set of visual input data (cf. [Herzog et al. 89; Koller et al. 92; Neumann 89; Tsotsos 85; Wahlster et al. 83; Walter et al. 88]). Intelligent multimedia generation systems which employ several media such as text, graphics, and animation for the presentation of information (cf. [Arens et al. 93; Feiner & McKeown 93; Maybury 93; Roth et al. 91; Stock 91; Wahlster et al. 93]) increasingly attract attention in many application areas since they (1) are able to exibly tailor presentations to a user's individual needs and style preferences, (2) may use one medium in place of another, and (3) may combine media so that the strength of one medium will overcome the weakness of another. Presentation Styles: Reporting Mode: - tv-style reports simultaneous e.g. retrospective . . . U s e d M e d i a: - radio-style reports e.g. authentic video speech . . . diagrams - headlines written text e.g. Degree of Detail . . . complete descriptions - illustrated newspaper reports e.g. . . . only outstanding or unexpected observations summary Figure 1: Examples of presentation styles Combining techniques for image understanding and intelligent multimedia generation will open the door to an interesting new type of computer2 based information system that provides highly exible access to the visual world. To see the benets of such systems we may look at information presentation in mass media like newspaper and television. There, multiple media have been used for years when reporting events, e.g., in sports reporting. The spectrum of commonly used presentation forms covers printed, often illustrated text, verbally commented pictures, authentic video clips, commented video etc. However, the eort needed for manually preparing such presentations impedes broad production of presentations for individual users. In contrast to that, an advanced computer-based reporting system could provide a low-cost way to present the same information in various forms depending on generation parameters such as a user's actual interests and style preferences, time-restrictions etc. Fig. 1 gives an impression of the variety of presentation styles that result from combining only three basic criteria: the information requirements, the reporting mode (delay between data perception and information presentation), the medium used in presentation. The work described in this paper aims at a multimedia reporting system. Following the paradigm of rapid prototyping, we rely on our previous work in both analysis and interpretation of image sequences (cf. [Herzog et al. 89; Herzog & Wazinski 94]) and generation of multimedia presentations (cf. [Wahlster et al. 93; Andre & Rist 90]). Short sections of video recordings of soccer games have been chosen as the domain of discourse since they oer interesting possibilities for the automatic interpretation of visual data in a restricted domain. Also, the broad variety of commonly used presentation forms in sports reporting provides a fruitful inspiration when investigating methods for automated generation of multimedia reports. 2 From Visual Data to Multimedia Presentations Our eorts aim at a system that essentially transforms acquired visual data into meaningful information which in turn will be transformed into a structured multimedia presentation. Fig. 2 provides a classication of representation formats as they may be used to bridge between the dierent steps of the transformation. In the following, we describe a decomposition of the transformation process into maintainable subtasks: Processing image sequences The processes on the sensory level start from digitized video frames and serve for the automated construction of symbolic computer-internal descriptions of 3 Meier passes the ball to multimedia output ... Presentation Level mm-discourse structure ... presentation goals ... (Elaborate-Subevent S U ball-transfer#23 T) Conceptual Level intentions and interactions ... (Goal player#5 (attack player#8)) event propositions ... (Proceed [3:39:05] (Event ball-transfer#23)) relation tuples ... (s-rel-in ball#1 penality-area#2) GSD TRAJ OBJ#001 OBJ#002 Sensory Level R1 R2 545.5 564.3 123.4 432.4 312.6 234.2 234.4 321.2 ............. digitized image sequence Figure 2: Levels of representation perceived scenes. The analysis of time-varying image sequences is of particular importance. In this case, the processing concentrates on the recognition and tracking of moving objects. In the narrow sense, the intended output of a vision system would be an explicit, meaningful description of visible objects. Throughout this paper, we will use the term geometrical scene description (GSD), introduced in [Neumann 89], for this kind of representation. Interpretation of visual information High-level scene analysis aims at recognizing conceptual units at a higher level of abstraction and thus extends the scope of image understanding. The GSD serves as an intermediate representation between low-level image analysis and scene analysis and forms the basis for further interpretation processes which lead to representations constituting the conceptual level. These representations include spatial relations for the explicit characterization of spatial 4 arrangements of objects, representations of recognized object movements, and also higher-level concepts such as representations of behaviour and interaction patterns of the agents observed. Since one and the same scene may be interpreted dierently by dierent observers, the interpretation process should be exible enough to allow for situation-dependent interpretation. Content selection and organization Depending on the user's information needs the system has to decide which propositions from the conceptual level should be communicated. Even if a detailed description is requested, it would be inappropriate to mention every single proposition provided by the scene analysis component. Following general rules of communication as formulated by Grice [Grice 75], the system has to ensure that all relevant information will be encoded. On the other hand, the user should not be unnecessarily informed about facts he already knows. Furthermore, the system has to organize the selected contents in a coherent manner. Of course, a exible system that supports varying styles of reporting cannot rely on single strategies for content selection and organization. For example, in situations where all scene data are available before the generation process begins, content organization may use diverse sorting techniques to enhance coherency. These techniques usually fail in live reporting where visual data are to be described while they are recorded and interpreted. In that case, however, emphasis is usually more on topicality than on coherency of the description. Coordinated distribution of information on several media An optimal exploitation of dierent media requires a presentation system to decide carefully when to use one medium in place of another and how to integrate dierent media in a consistent and coherent manner. This also includes determining an appropriate degree of complementarity and redundancy of information presented in dierent media. A presentation that contains no redundant information at all tends to be incoherent. If, however, too much information is paraphrased in dierent media, the user may concentrate on one medium after a short time and probably overlooks information. Medium-specic encoding of information A multimedia presentation system must manage the presentation of text, graphics, video, and whatever media it employs. In the simplest case, presentation means automatic retrieval of already available output units, e.g, canned text or recorded video clips. More ambitious approaches address generation from scratch. Such approaches incorporate design expertise and provide mechanisms for the automatic selection, creation, and combination 5 of medium-specic primitives (e.g, words, icons, video frames, etc.). To ensure coherency of presentations, output fragments in dierent media have to be tailored to each other. Therefore, no matter how medium-specic output will be produced, it is important that the system maintains an explicit representation of the encodings used. Output coordination The last step of the transformation concerns the arrangement of presentation fragments provided by the generators in a multimedia output. A purely geometrical treatment of this layout task would, however, lead to unsatisfactory results. Rather, layout has to be considered as an important carrier of meaning. For example, two pictures that serve to contrast objects should be placed side by side. When using dynamic media, such as animation and speech, layout design also requires the temporal coordination of output units. An identication of subtasks as described above gives an idea of the processes that a reporting system has to maintain. The architectural organization of these processes is, however, a crucial issue, especially when striving for a system that supports various presentation styles. For example, the automatic generation of live presentations calls (1) for an incremental strategy for the recognition of object movements and assumed intentions, and (2) for an adequate coordination of recognition and presentation processes. Also, there are various dependencies between choices in the presentation part. To cope with such dependencies, it seems unavoidable to interleave the processes for content determination, mode selection and content realization. 3 VIPS: A Visual Information Presentation System The most straightforward approach to building a reporting system is to rely on existing modules for the interpretation of image sequences, and the generation of multimedia presentations. When conceiving our prototype system, called Vips, we consequently follow this approach when the reuse of modules from our previous systems Vitra [Herzog & Wazinski 94] and Wip [Andre & Rist 94] is possible. In the following, we sketch the processing mechanisms of Vips' core modules. 3.1 Image Analysis For technical reasons, we do not directly incorporate a low-level vision component for the processing of the camera data into Vips. Rather, this task 6 is done with the systems Actions [Sung 88] and Xtrack [Koller et al. 92] that have been developed by our partners at the Fraunhofer Institute for Information and Data Processing (IITB) in Karlsruhe. Actions recognizes moving objects within real world image sequences. It performs a segmentation and cueing of moving objects by computing and analyzing displacement vector elds. The more recent Xtrack system accomplishes a model-based recognition and classication of rigid objects. Sequences of up to 1000 images, i.e. 40 seconds play back time, recorded with a stationary TV-camera during a game in the German professional soccer league, have been evaluated by the Actions system (cf. [Herzog et al. 89]). In this domain, segmentation becomes quite dicult because the moving objects cannot be regarded as rigid and occlusions occur very frequently. The as yet partial trajectories delivered by Actions are currently used to synthesize interactively a realistic GSD, with object candidates assigned to previously known players and the ball. The approach described in [Rohr 94] for the geometric modeling of an articulated body has been adopted in Vips in order to represent the players in the soccer domain (cf. [Herzog 92b]). The stationary part of the GSD, an instantiated model of the static background, is fed into the system manually. 3.2 Scene Interpretation Many previous attempts at high-level scene analysis (e.g. [Neumann 89; Wahlster et al. 83; Walter et al. 88]) are based on an a posteriori interpretation strategy, which requires a complete GSD covering the entire image sequence as soon as the analysis process starts. Hence, these systems generate retrospective scene descriptions, only. Greater exibility can be achieved if an incremental strategy is employed (cf. [Herzog et al. 89; Koller et al. 92; Tsotsos 85]), with a GSD constructed step by step and processed simultaneously as the scene progresses. Immediate system reactions, as needed for live presentations, and within autonomous systems, are possible, because information about the present scene is provided, too. In Vips high-level scene analysis includes: Computation of spatial relations In the GSD spatial information is encoded only implicitly. In analogy to prepositions, their linguistic counterparts, spatial relations provide a qualitative description of spatial arrangements of objects. Each spatial relation characterizes a class of object congurations by specifying conditions, such as the relative position of objects or the distance between them. Instead of assigning simple truth values to spatial predications, a measure of degrees of applicability has been introduced that expresses the extent to 7 which a spatial relation is applicable (cf. [Andre et al. 87]). On the one hand, more exact scene descriptions are possible since the degree of applicability can be expressed linguistically (e.g. `directly behind' or `more or less in front of'). One the other hand, the degree of applicability can be used to select the most appropriate reference object(s) and relation if an object conguration can be described by several spatial predications. Our system is capable of computing topological (e.g. in, near, etc.) as well as orientation-dependent relations (e.g. left-of, over, etc.). Since the frame of reference is explicitly taken into account, the system can cope with the intrinsic, extrinsic, and deictic use of directional prepositions (cf. [Andre et al. 87; Gapp 94]). Characterization and interpretation of object movements When analyzing time-varying image sequences, spatio-temporal concepts can also be extracted from the GSD. These conceptual units, which we will call motion events, serve for the symbolic abstraction of the temporal aspects of the scene. With respect to the natural language description of image sequences events are meant to represent the meaning of motion and action verbs. The recognition of movements is based on event models, i.e., declarative descriptions of classes of higher conceptual units capturing the spatiotemporal aspects of object motions. The event concepts are organized into an abstraction hierarchy, grounded on specialization (e.g., running is a moving) and temporal decomposition (cf. Fig. 3). This conceptual hierarchy can also be utilized to guide the selection of the relevant propositions when producing a presentation. Besides the question of which events are to be extracted from the GSD, it is decisive how the recognition process is realized. With respect to the generation of simultaneous multimedia presentations, the following problem becomes obvious. If the presentation is to be focused on what is currently happening, it is very often necessary to describe object motions even while they occur. Thus, motion events have to be recognized stepwise as they progress and event instances must be made available for further processing from the moment they are noticed rst. Since the distinction between events that have and those that have not occurred is insucient, we have introduced the additional predicates start, proceed, and stop which can be used to characterize the progression of an event (cf. [Andre et al. 88]). Labeled directed graphs with edges of a certain type, so called course diagrams, are used to model the prototypical progression of an event. Fig. 4 shows a simplied course diagram for the concept BALL-TRANSFER. It describes a situation in which a player passes the ball to a teammate. The event starts if a BALL-POSSESSION event stops and the ball is free. The 8 Header: (BALL-TRANSFER ?p1*player ?b*ball ?p2*player) Conditions: (eql (TEAM ?p1) (TEAM ?p2)) Subconcepts: (BALL-POSSESSION ?p1 ?b) [I1] (MOVE-FREE ?b) [I2] (BALL-POSSESSION ?p2 ?b) [I3] Temporal-Relations: [I1] :meets [BALL-TRANSFER] [I1] :meets [I2] [I2] :equal [BALL-TRANSFER] [I2] :meets [I3] Figure 3: Event model event proceeds as long as the ball is moving free and stops when the recipient has gained possession of the ball. The recognition of an occurrence can be thought of as traversing the course diagram, where the edge types are used for the denition of the basic event predicates. Course diagrams rely on a discrete model of time, which is induced by the underlying sequence of digitized TV-frames. They allow incremental event recognition, since exactly one edge per unit of time is traversed. Using constraint-based temporal reasoning, course diagrams are constructed automatically from interval-based concept denitions (cf. [Herzog 92a]). :PROCEED Condition: (PROCEED (MOVE-FREE ?b) ?t) S0 :START S1 Condition: :STOP S2 Condition: (AND (STOP (BALL-POSS ?p1 ?b) ?t) (START (MOVE-FREE ?b) ?t)) (AND (STOP (MOVE-FREE ?b) ?t)) (START (BALL-POSS ?p2 ?b) ?t) Figure 4: Course diagram 9 Recognition of presumed goals and plans of the observed agents For human observers the interpretation of visual information also involves inferring the intentions, i.e. the plans and goals, of the observed agents (e.g., player A does not simply approach player B, but he tackles him). In the soccer domain the inuence of the agents' assumed intentions on the results of the scene analysis is particularly obvious. Given the position of players, their team membership and the distribution of roles in standard situations, stereotypical intentions can be inferred for each situation. We use the system component described in [Retz-Schmidt 91], which is able to incrementally recognize intentions of and interactions between the agents as well as the causes of possible plan failures. Partially instantiated plan hypotheses taken from a plan library are successively instantiated according to the incrementally recognized events. Each element of the plan library contains information about necessary preconditions of the (abstract) action it represents as well as information about its intended eect. A hierarchical organization is achieved through the decomposition and specialization relation. Observable events and spatial relations constitute the leaves of the plan hierarchy. Knowledge about the cooperative (e.g., double-pass) and antagonistic behaviour (e.g., oside-trap) of the players is represented in the interaction library. A successful plan triggers the activation of a corresponding interaction schema. 3.3 Presentation Planning Following a speech-act theoretic perspective, the generation of multimedia documents is considered as a goal-directed activity (cf. [Andre & Rist 90]). Starting from a communicative goal (e.g., describe the scene), a presentation planner builds up a renement-style plan in the form of a directed acyclic graph (DAG). This plan reects the propositional contents of the potential document parts, the intentional goals behind the parts as well as rhetorical relationships between them (cf. [Andre & Rist 93]). While the top of the presentation plan is a more or less complex presentation goal, the lowest level is formed by specications of elementary presentation tasks (e.g., formulating a request or depicting an object) that are directly forwarded to the mediumspecic design components. To represent presentation knowledge, we have dened strategies that refer to both text and picture production. While some strategies reect general presentation knowledge, others are more domain-dependent and specify how to present a certain subject. To utilize the plan-based approach in Vips, we dene new strategies for scene description. For example, the strategy shown in Fig. 5 may be used to verbally describe a sequence of events by informing 10 the user about the main events (e.g.,team-attack), illustrating them by a snapshot and to provide more details about the subevents (e.g., kick). Header: (Describe-Scene S U ?events T) Eect: (FOREACH ?one-ev WITH (AND (BEL S (Main-Ev ?one-ev)) (BEL S (In ?one-ev ?events))) (BMB S U (In ?one-ev ?events))) Applicability-Conditions: (BEL S (Temporally-Ordered-Sequence ?events)) Main Acts: ((FOREACH ?one-ev WITH (AND (BEL S (Main-Ev ?one-ev)) (BEL S (In ?one-ev ?events))) (Inform S U ?one-ev T))) Subsidiary Acts: ((Illustrate S U ?ev G) (Elaborate-Subevents S U ?sub-ev ?medium)) Figure 5: Plan operator for describing a scene To accomplish the last communicative act, the strategy shown in Fig. 6 may be applied. It informs the user about all salient subevents and provides more details about the agents involved. To determine the salience of an event, factors such as its frequency of occurrence, the complexity of its generic event model, the salience of involved objects and the area in which it takes place are taken into account (see also [Andre et al. 88]). All events are described in their temporal order. Further grouping principles for events are discussed in [Maybury 91]. The strategies dened in Fig. 5 and Fig. 6 can be used to generate a posteriori scene descriptions. They presuppose that the input data from which relevant information has to be selected are a priori given. Since both strategies iterate over complete lists of temporally ordered events, the presentation process cannot start before the interpretation of the whole scene is completed. However, Vips is also able to generate live reports. The main characteristic of this kind of presentation is that input data are continuously delivered by a scene interpretation system and the presentation planner has to react immediately to incoming data. In such a situation, no global organization of the presentation is possible. Instead of collecting scene data and organizing them (e.g., according to their temporal order as in the rst two strategies), the system has to locally decide which event should be reported next considering the current situation. Such behavior is reected by the strategy shown 11 Header: (Elaborate-Subevent S U ?ev T) Eect: (FOREACH ?sub-ev WITH (AND (BEL S (Salient ?sub-ev)) (BEL S (Sub-Ev ?sub-ev ?ev))) (BMB S U (Sub-Ev ?sub-ev ?ev))) Applicability-Conditions: (AND (BEL S (Sub-Events ?ev ?sub-events)) (BEL S (Temporally-Ordered-Sequence ?sub-events))) Main Acts: ((FOREACH ?sub-ev WITH (AND (BEL S (In ?sub-ev ?sub-events)) (BEL S (Salient ?sub-ev))) (Inform S U ?sub-ev T))) Subsidiary Acts: ((Elaborate-Agents S U ?sub-ev ?medium)) Figure 6: Plan operator for describing subevents in Fig. 7. In contrast to the strategy shown in Fig. 6, events are selected for their topicality. Topicality is determined by the salience of an event and the time that has passed since its occurrence. Consequently, the topicality of events decreases as the scene progresses. If an outstanding event (e.g., a goal kick) occurs which has to be verbalized as soon as possible, the presentation planner may even give up partially planned presentation parts to communicate the new event as soon as possible. Header: (Describe-Next S U ?ev T) Eect: (AND (BMB S U (Next ?preceding-ev ?ev)) (BMB S U (Last-Reported ?ev))) Applicability-Conditions: (AND (BEL S (Last-Reported ?preceding-ev)) (BEL S (Topical ?ev *Time-Available*)) (BEL S (Next ?preceding-ev ?ev))) Main Acts: ((Inform S U ?ev T)) Subsidiary Acts: ((Describe-Next S U ?next-ev T)) Figure 7: Plan operator for simultaneous description The realization of the main act in Fig. 7 depends on whether the user has visual access to the scene or not. For example, an utterance, such as \pay attention to the player in the penalty area" does not make much sense if the user does not see the scene. 12 3.4 Generating textual presentation parts As for the event recognition component, the text generator described in [Harbusch et al. 91] follows an incremental processing scheme. It can begin outputting words before the input is complete. Such generators are more exible because they can also be used in situations where it is not possible to delay the output until the input is complete (cf. [Finkler & Schauder 92]). However, it is no longer guaranteed that new input can always be integrated into a previously uttered part of a sentence. In such a case, revisions are necessary. The rst component that is activated during natural language generation is the text design component. As soon as the presentation planner decides that a particular element should be presented as part of a text, the element is handed over as input to this component. The main task of the text design component is the organization of input elements into clauses. This comprises the determination of the order in which the given input elements can be realized in the text and lexical choice. The results of the text designer are preverbal messages. These preverbal messages are forwarded in a piecemeal fashion to the text realization component where grammatical encoding, linearization and inection take place. The text realization component is based on the formalism of Lexicalized LD/LP Tree Adjoining Grammars. It associates lexical items with syntactic rules, permits exible expansion operations and allows the description of local dominance to be separated from linear precedence rules. These characteristics made it a good candidate for incremental generation. 3.5 Generating visual presentation parts In a system like Vips it is quite natural to base the generation of visual presentations on the camera-recorded visual data and on information obtained from various levels of image interpretation. For example, when generating live reports, one may include original camera data directly in the presentation. In this case, the graphics generator will only be requested to forward the camera data to a video window. To deal with more interesting tasks the system must have appropriate generation techniques at its disposal. For the current version of Vips, we have developed techniques for: Content-based search for subsequences A recorded image sequence can be split into subsequences of arbitrary length between one image and all images. Content-based search serve to nd such subsequences according to semantic criteria. For example, one may be interested in the occurrence of a particular event, or the trajectory of a certain 13 agent or object. In contrast to video transcription and presentation systems, such as Ivaps [Csinger & Booth 94], Vips' graphics generator benets from the connection to the image understanding component. Search specications are formulated on the level of event propositions. Tracing back the event recognition process, the original image data are localized and the corresponding subsequences are returned. Display style modications When displaying an image or an image sequence, material and temporal aspects may be modied in order to accomplish certain communicative goals, or to meet situation-dependent constraints, e.g, resource limitations. Concerning the visual appearance of objects, Vips supports photorealistic display styles (cf. [Herzog 92b]). Such presentations can be realized as ltered displays of the original camera frames. Starting from the propositional GSD, Vips is also able to produce schematic pictures and animations, likewise in 2D or 3D. In the schematic mode, static background objects are approximated by line drawings/3D models, and moving objects are represented by predened icons/3D bodies. The generation of 3D animations is an interesting feature, since it allows the \re-recording" of a scene from arbitrary viewpoints, e.g., from the viewpoints of agents and objects involved. Concerning the temporal aspect of an image sequence, three display modes can be chosen: true time (25 frames per second), slow motion, and quick motion. time [6:57:80] ... Bommer, the midfield player passes the ball to Bosch, the outside left. Bosch is attacked by Maller, the outside right. Figure 8: Live presentation Data aggregation Showing an original image sequence is often less eective than a presentation with less visual data. This becomes obvious when dynamic concepts have to be visualized by static graphics which are to be included in a print 14 document. The mere listing of frames is inappropriate because there is a high risk that an observer wouldn't see the "trees for the forest". Reporting in mass media gives valuable inspirations for enhancing the eectiveness of visual presentations by means of aggregation techniques. In Vips, we aim at operationalizations of such techniques for the production of dynamic and static visual presentations. For example, recorded video sequences can be shortened by cutting out less interesting frames. To nd frames which can be omitted without destroying the presentation, we take into account the recognized event structure of the sequence and use criteria such as spatial coherency of objects in subsequent frames. In the case of static graphics, we also start from the event structure to nd the most signicant key frames of a sequence. For some purposes a single key frame will suce, e.g, when the result or outcome of an event has to be shown. In other situations, one may apply techniques used in technical illustrations to aggregate the information of several images into a single one. For example to visualize an object trajectory, we start from a key frame that shows the object either in its start or end position and then superimpose an arrow on the image to trace the object's locations in succeeding or preceding frames. Visualization of inferred information The interpretation of image sequences may lead to information which is not directly apparent in the raw image data. This does, however, not mean that inferred information cannot be presented visually. Marking objects/object groups by color or annotating them with text labels are simple techniques to include additional information in a graphical presentation. For other purposes, superimposition techniques are more suitable. For example, when analyzing a soccer game, it may be of interest to wonder whether a player had alternative moves in a crucial situation. Provided that the image interpretation system is able to recognize such alternatives, they can be visualized by superimposing hypothetical trajectories on the original scene data. 4 Generation Examples To give an impression of how Vips works, we present two generation examples taken from the domain of soccer. In the rst example, a TV-style live report is to be generated, i.e., a soccer scene has to be described while the scene progresses. We simulate the progress of the scene by passing the GSD data incrementally to the recognition component. We choose as presentation media text and video which, in this case, means displaying the original image sequence without further modications. For this kind of presentation, video is considered as 15 the guiding medium to which textual comments have to be tailored. The example is illustrated by Fig. 8 which shows a part of the coordinated stream of visual and textual output. To describe the underlying generation process, we start at the image frame marked by the timestamp [6:57:80] which is displayed shortly after a preceding utterance has been completed. To select the next event to be textually communicated, the presentation planner applies the strategy shown in Fig. 7. When testing the applicability conditions of this strategy, the variable ?preceding-ev is instantiated with the last proposition that has been verbalized. To instantiate the variable ?ev, the presentation planner searches for a topical event that comes after ?precedingev. In the example, ?ev is bound to (Ball-Transfer :Agent player#6 :Object ball :Recipient nil :Begin [6:57:50]). This event is selected since it is the only event in which the ball is involved as one of the most salient objects. After the renement of (Inform S U ?ev T), the following acts have been posted as new subgoals: four referential acts for specifying the action and its associated case roles and an elementary surface speech act, S-Inform, that is passed on to the text designer. Note that the presentation planner forwards a certain piece of information to the generator concerned as soon as it has decided which component should encode it. In our example, (SInform S U ...) is sent to the text generator although a content specication for the recipient is still missing. The text designer creates input for the TAGbased realization component which starts processing this input and generates: \Bommer, the mideld player, passes the ball ...". In the meantime, the event recognition component has identied the recipient of the ball (player#7). This new information allows the presentation planner to determine the following content specication: (the ?z (name ?z Bosch) (outside-left ?z)). Thus, the incomplete specication that has been sent to the text generator is supplemented accordingly. In this case, the text generator is able to complete the sentence just by adding the prepositional phrase \to Bosch, the outside left.". Of course, there are also situations in which revisions are necessary (cf. [Wahlster et al. 93]). Meanwhile, the presentation planner has again applied the strategy shown in Fig. 7, and ?ev is bound to (Attack :Agent player#20 :Patient player#7). After completing the last sentence, the realization component generates \He is attacked by Maller, the outside right." In the second example, we assume that a retrospective description of a past scene is to be generated in a format that can be printed on paper. In this case, the system has to accomplish the goal (Describe-Scene S U ?events T) whereby the variable ?events is bound to a list of temporally ordered events delivered by the recognition component. The presentation planner rst determines the main events and forwards a content specication to the text generator. In addition, it requests the graphics generator to illustrate the course of the event. Since only static graphics can be printed on paper, 16 In the 15th minute, team A started an attack. Bösel (8), the outside left, centered the ball to Britz (9) in front of the goal. The goal keeper (1) intercepted the ball. Figure 9: A posteriori report it's not possible to include the original video sequence in the presentation. Therefore, the graphics designer starts with a snapshot showing the positions of the players at the beginning of the events and relies on data aggregation to encode the trajectories of the moving objects (cf. Fig. 9). During the generation of the illustration, the presentation planner has expanded (ElaborateSubevents S U ?sub-ev ?medium) to determine which information about the subevents should be communicated to the user. In order to facilitate referent identication, the system has attached the numbers used in the icons to the expressions referring to the players. 17 5 Summary In this contribution, we have reported on our eorts to bridge from computer vision to multimedia generation. We have outlined the system Vips that takes camera recorded image sequences as input and uses incremental strategies for the recognition of higher-level concepts such as spatial relations, motion events and intentions, and relies on a plan-based approach to communicate recognized occurrences with multiple presentation media. Implementations of most of the core modules (scene interpretation, presentation planner, text generator,) are already available and allow the automatic generation of textual descriptions for short image sequences. The knowledge-base of the system currently consists of about 100 concept denitions for spatial relations, motion events, plans, and plan interaction schemata. As yet, the graphics generation component only provides the basic functions (display of video sequences/single frames, icon-based visualization of trajectory data). To generate presentation examples, as presented in section 4, interfacing between some components still has to be done manually. Our current eorts aim at a fully integrated version of the Vips system with improved graphics capabilities. Perhaps the most interesting topic for further research is the bidirectional interleaving of image interpretation and presentation planning. In some situations, it would be useful for the presentation planner to request particular information from the interpretation system which may eventually force the vision system to actively achieve conditions under which this information can be obtained, e.g., by changing the sensor parameters. Active vision is particularly required when information is missing which is needed to decide whether an applicability condition of a presentation strategy is satised or not. Furthermore, this feature could be used for the generation of visual presentation fragments - one simply drives the camera to obtain a certain picture or video clip. Acknowledgements The work described in this paper was partly supported by the Special Collaborative Program on AI and Knowledge-based Systems (SFB 314), project VITRA, of the German Science Foundation (DFG) and by the German Ministry for Research and Technology (BMFT) under grant ITW8901 8, project WIP. We would like to thank Wolfgang Wahlster who as a leader of both projects made this cooperation possible. 18 References [Andre & Rist 90] E. Andre and T. Rist. Towards a Plan-Based Synthesis of Illustrated Documents. In: Proc. of the 9th ECAI, pp. 25{30, Stockholm, 1990. [Andre & Rist 93] E. Andre and T. Rist. The Design of Illustrated Documents as a Planning Task. In: M. T. Maybury (ed.), Intelligent Multimedia Interfaces, pp. 94{116. Menlo Park, CA: AAAI Press, 1993. [Andre & Rist 94] E. Andre and T. Rist. Generating Coherent Presentations Employing Textual and Visual Material. Articial Intelligence Review Journal, 8(3), 1994. [Andre et al. 87] E. Andre, G. Bosch, G. Herzog, and T. Rist. Coping with the Intrinsic and the Deictic Uses of Spatial Prepositions. In: K. Jorrand and L. Sgurev (eds.), Articial Intelligence II: Methodology, Systems, Applications, pp. 375{382. Amsterdam: North-Holland, 1987. [Andre et al. 88] E. Andre, G. Herzog, and T. Rist. On the Simultaneous Interpretation of Real World Image Sequences and their Natural Language Description: The System SOCCER. In: Proc. of the 8th ECAI, pp. 449{454, Munich, 1988. [Arens et al. 93] Y. Arens, E. Hovy, and S. van Mulken. Structure and Rules in Automated Multimedia Presentation Planning. In: Proc. of the 13th IJCAI, pp. 1253{1259, Chambery, France, 1993. [Csinger & Booth 94] A. Csinger and K. S. Booth. Reasoning about Video: Knowledge-based Transcription and Presentation. In: J. F. Nunamaker and R. H. Sprague (eds.), HICSS-94, volume III, Information Systems: Decision Support and Knowledge-based Systems, pp. 599{ 608, Maui, HI, 1994. [Feiner & McKeown 93] S. K. Feiner and K. R. McKeown. Automating the Generation of Coordinated Multimedia Explanations. In: M. T. Maybury (ed.), Intelligent Multimedia Interfaces, pp. 117{138. Menlo Park, CA: AAAI Press, 1993. [Finkler & Schauder 92] W. Finkler and A. Schauder. Eects of Incremental Output on Incremental Natural Language Generation. In: Proc. of the 10th ECAI, pp. 505{507, Vienna, 1992. 19 [Gapp 94] K.-P. Gapp. Basic Meanings of Spatial Relations: Computation and Evaluation in 3D Space. In: Proc. of AAAI-94, pp. 1393{1398, Seattle, WA, 1994. [Grice 75] H. P. Grice. Logic and Conversation. In: P. Cole and J. L. Morgan (eds.), Speech Acts, pp. 41{58. London: Academic Press, 1975. [Harbusch et al. 91] K. Harbusch, W. Finkler, and A. Schauder. Incremental Syntax Generation with Tree Adjoining Grammars. In: W. Brauer and D. Hernandez (eds.), Verteilte Kunstliche Intelligenz und kooperatives Arbeiten: 4. Int. GI-Kongre Wissensbasierte Systeme, pp. 363{374. Berlin, Heidelberg: Springer, 1991. [Herzog & Wazinski 94] G. Herzog and P. Wazinski. VIsual TRAnslator: Linking Perceptions and Natural Language Descriptions. Articial Intelligence Review, 8(2/3):175{187, 1994. [Herzog et al. 89] G. Herzog, C.-K. Sung, E. Andre, W. Enkelmann, H.H. Nagel, T. Rist, W. Wahlster, and G. Zimmermann. Incremental Natural Language Description of Dynamic Imagery. In: C. Freksa and W. Brauer (eds.), Wissensbasierte Systeme. 3. Int. GI-Kongre, pp. 153{162. Berlin, Heidelberg: Springer, 1989. [Herzog 92a] G. Herzog. Utilizing Interval-Based Event Representations for Incremental High-Level Scene Analysis. In: M. Aurnague, A. Borillo, M. Borillo, and M. Bras (eds.), Proc. of the 4th International Workshop on Semantics of Time, Space, and Movement and SpatioTemporal Reasoning, pp. 425{435, Ch^ateau de Bonas, France, 1992. [Herzog 92b] G. Herzog. Visualization Methods for the VITRA Workbench. Memo 53, Universitat des Saarlandes, SFB 314 (VITRA), 1992. [Koller et al. 92] D. Koller, N. Heinze, and H.-H. Nagel. Algorithmic Characterization of Vehicle Trajectories from Image Sequences by Motion Verbs. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 90{95, Maui, Hawaii, 1992. [Maybury 91] M. T. Maybury. Planning Multisentential English Text Using Communicative Acts. PhD thesis, Rome Air Development Center, Air Force Systems Command, Gris Air Force Base, NY, 1991. [Maybury 93] M. T. Maybury. Planning Multimedia Explanations Using Communicative Acts. In: M. T. Maybury (ed.), Intelligent Multimedia Interfaces, pp. 60{74. 1993. 20 [Neumann 89] B. Neumann. Natural Language Description of TimeVarying Scenes. In: D. L. Waltz (ed.), Semantic Structures: Advances in Natural Language Processing, pp. 167{207. Hillsdale, NJ: Lawrence Erlbaum, 1989. [Retz-Schmidt 91] G. Retz-Schmidt. Recognizing Intentions, Interactions, and Causes of Plan Failures. User Modeling and User-Adapted Interaction, 1:173{202, 1991. [Rohr 94] K. Rohr. Towards Model-based Recognition of Human Movements in Image Sequences. Computer Vision, Graphics, and Image Processing (CVGIP): Image Understanding, 59(1):94{115, 1994. [Roth et al. 91] S. F. Roth, J. Mattis, and X. Mesnard. Graphics and Natural Language as Components of Automatic Explanation. In: J. W. Sullivan and S. W. Tyler (eds.), Intelligent User Interfaces, pp. 207{ 239. New York, NY: ACM Press, 1991. [Stock 91] O. Stock. Natural Language and Exploration of an Information Space: The ALFresco Interactive System. In: Proc. of the 12th IJCAI, pp. 972{978, Sidney, Australia, 1991. [Sung 88] C.-K. Sung. Extraktion von typischen und komplexen Vorgangen aus einer langen Bildfolge einer Verkehrsszene. In: H. Bunke, O. Kubler, und P. Stucki (Hrsg.), Mustererkennung 1988, pp. 90{96. Berlin, Heidelberg: Springer, 1988. [Tsotsos 85] J. K. Tsotsos. Knowledge Organization and its Role in Representation and Interpretation for Time-Varying Data: the ALVEN System. Computational Intelligence, 1:16{32, 1985. [Wahlster et al. 83] W. Wahlster, H. Marburger, A. Jameson, and S. Busemann. Over-answering Yes-No Questions: Extended Responses in a NL Interface to a Vision System. In: Proc. of the 8th IJCAI, pp. 643{646, Karlsruhe, FRG, 1983. [Wahlster et al. 93] W. Wahlster, E. Andre, W. Finkler, H.J. Protlich, and T. Rist. Plan-Based Integration of Natural Language and Graphics Generation. Articial Intelligence, 63:387{427, 1993. [Walter et al. 88] I. Walter, P. C. Lockemann, and H.-H. Nagel. Database Support for Knowledge-Based Image Evaluation. In: P. M. Stocker, W. Kent, and R. Hammersley (eds.), Proc. of the 13th Conf. on Very Large Databases, Brighton, UK, pp. 3{11. Los Altos, CA: Morgan Kaufmann, 1988. 21