Multimedia Presentation of Interpreted Visual Data

advertisement
Sonderforschungsbereich 314
Künstliche Intelligenz - Wissensbasierte Systeme
KI-Labor am Lehrstuhl für Informatik IV
Leitung: Prof. Dr. W. Wahlster
VITRA
Universität des Saarlandes
FB 14 Informatik IV
Postfach 151150
D-66041 Saarbrücken
Fed. Rep. of Germany
Tel. 0681 / 302-2363
Bericht Nr. 103
Multimedia Presentation of Interpreted
Visual Data
Elisabeth André, Gerd Herzog, Thomas Rist
Juni 1994
ISSN 0944-7814
103
Multimedia Presentation of
Interpreted Visual Data
Elisabeth Andre, Gerd Herzog, Thomas Rist
German Research Center for Articial Intelligence (DFKI)
D-66123 Saarbrucken, Germany
fandre, ristg@dfki.uni-sb.de
SFB 314, Project VITRA, Universitat des Saarlandes
D-66041 Saarbrucken, Germany
herzog@cs.uni-sb.de
Juni 1994
Abstract
While computer vision aims at the transformation of image data
into meaningful information, research in intelligent multimedia generation addresses the eective communication of information using multiple media such as text, graphics and video. We argue that combining
the two research areas leads to an interesting new kind of information
system. Such integrated systems will be able to exibly transform
visual data into various presentation forms including, for example,
TV-style reports and illustrated articles. The paper elaborates on
this transformation and provides a modularization into maintainable
subtasks. How these subtasks can be accomplished will be sketched by
means of Vips, a prototype system that has emerged from our previous work on scene analysis and multimedia generation. Vips analyses
short sections of camera-recorded image sequences of soccer games and
generates multimedia presentations of the interpreted visual data.
To appear in: Proc. of AAAI-94, Workshop on \Integration of Natural Language and
Vision Processing", Seattle, WA, 1994.
1
1 Introduction
Image understanding systems which perform a qualitative interpretation of a
continuous ow of visual data allow observation of inaccessible areas and will
release humans from time-consuming and often boring observation tasks, e.g,
in trac control. Moreover, a sophisticated system may not only collect and
condense data but also interpret them in a particular context and provide
information that goes far beyond the set of visual input data (cf. [Herzog
et al. 89; Koller et al. 92; Neumann 89; Tsotsos 85; Wahlster et al. 83;
Walter et al. 88]).
Intelligent multimedia generation systems which employ several media
such as text, graphics, and animation for the presentation of information (cf.
[Arens et al. 93; Feiner & McKeown 93; Maybury 93; Roth et al. 91; Stock 91;
Wahlster et al. 93]) increasingly attract attention in many application areas
since they (1) are able to exibly tailor presentations to a user's individual
needs and style preferences, (2) may use one medium in place of another, and
(3) may combine media so that the strength of one medium will overcome
the weakness of another.
Presentation Styles:
Reporting Mode:
- tv-style reports
simultaneous
e.g.
retrospective
. . .
U s e d M e d i a:
- radio-style reports
e.g.
authentic video
speech
. . .
diagrams
- headlines
written text
e.g.
Degree of Detail
. . .
complete descriptions
- illustrated newspaper
reports
e.g.
. . .
only outstanding or unexpected observations
summary
Figure 1: Examples of presentation styles
Combining techniques for image understanding and intelligent multimedia generation will open the door to an interesting new type of computer2
based information system that provides highly exible access to the visual
world.
To see the benets of such systems we may look at information presentation in mass media like newspaper and television. There, multiple media
have been used for years when reporting events, e.g., in sports reporting. The
spectrum of commonly used presentation forms covers printed, often illustrated text, verbally commented pictures, authentic video clips, commented
video etc. However, the eort needed for manually preparing such presentations impedes broad production of presentations for individual users. In
contrast to that, an advanced computer-based reporting system could provide
a low-cost way to present the same information in various forms depending on
generation parameters such as a user's actual interests and style preferences,
time-restrictions etc.
Fig. 1 gives an impression of the variety of presentation styles that result
from combining only three basic criteria: the information requirements, the
reporting mode (delay between data perception and information presentation), the medium used in presentation.
The work described in this paper aims at a multimedia reporting system.
Following the paradigm of rapid prototyping, we rely on our previous work
in both analysis and interpretation of image sequences (cf. [Herzog et al.
89; Herzog & Wazinski 94]) and generation of multimedia presentations (cf.
[Wahlster et al. 93; Andre & Rist 90]). Short sections of video recordings of
soccer games have been chosen as the domain of discourse since they oer
interesting possibilities for the automatic interpretation of visual data in a
restricted domain. Also, the broad variety of commonly used presentation
forms in sports reporting provides a fruitful inspiration when investigating
methods for automated generation of multimedia reports.
2 From Visual Data to Multimedia Presentations
Our eorts aim at a system that essentially transforms acquired visual data
into meaningful information which in turn will be transformed into a structured multimedia presentation. Fig. 2 provides a classication of representation formats as they may be used to bridge between the dierent steps
of the transformation. In the following, we describe a decomposition of the
transformation process into maintainable subtasks:
Processing image sequences
The processes on the sensory level start from digitized video frames and serve
for the automated construction of symbolic computer-internal descriptions of
3
Meier passes
the ball to
multimedia output ...
Presentation
Level
mm-discourse
structure
...
presentation goals ... (Elaborate-Subevent S U
ball-transfer#23 T)
Conceptual
Level
intentions and
interactions
... (Goal player#5
(attack player#8))
event
propositions
... (Proceed [3:39:05]
(Event ball-transfer#23))
relation tuples
... (s-rel-in ball#1 penality-area#2)
GSD
TRAJ
OBJ#001
OBJ#002
Sensory
Level
R1
R2
545.5 564.3 123.4 432.4
312.6 234.2 234.4 321.2
.............
digitized image
sequence
Figure 2: Levels of representation
perceived scenes. The analysis of time-varying image sequences is of particular importance. In this case, the processing concentrates on the recognition
and tracking of moving objects. In the narrow sense, the intended output of a
vision system would be an explicit, meaningful description of visible objects.
Throughout this paper, we will use the term geometrical scene description
(GSD), introduced in [Neumann 89], for this kind of representation.
Interpretation of visual information
High-level scene analysis aims at recognizing conceptual units at a higher
level of abstraction and thus extends the scope of image understanding. The
GSD serves as an intermediate representation between low-level image analysis and scene analysis and forms the basis for further interpretation processes
which lead to representations constituting the conceptual level. These representations include spatial relations for the explicit characterization of spatial
4
arrangements of objects, representations of recognized object movements,
and also higher-level concepts such as representations of behaviour and interaction patterns of the agents observed. Since one and the same scene may
be interpreted dierently by dierent observers, the interpretation process
should be exible enough to allow for situation-dependent interpretation.
Content selection and organization
Depending on the user's information needs the system has to decide which
propositions from the conceptual level should be communicated. Even if a
detailed description is requested, it would be inappropriate to mention every
single proposition provided by the scene analysis component. Following general rules of communication as formulated by Grice [Grice 75], the system
has to ensure that all relevant information will be encoded. On the other
hand, the user should not be unnecessarily informed about facts he already
knows. Furthermore, the system has to organize the selected contents in a
coherent manner. Of course, a exible system that supports varying styles
of reporting cannot rely on single strategies for content selection and organization. For example, in situations where all scene data are available before
the generation process begins, content organization may use diverse sorting
techniques to enhance coherency. These techniques usually fail in live reporting where visual data are to be described while they are recorded and
interpreted. In that case, however, emphasis is usually more on topicality
than on coherency of the description.
Coordinated distribution of information on several media
An optimal exploitation of dierent media requires a presentation system to
decide carefully when to use one medium in place of another and how to integrate dierent media in a consistent and coherent manner. This also includes
determining an appropriate degree of complementarity and redundancy of
information presented in dierent media. A presentation that contains no
redundant information at all tends to be incoherent. If, however, too much
information is paraphrased in dierent media, the user may concentrate on
one medium after a short time and probably overlooks information.
Medium-specic encoding of information
A multimedia presentation system must manage the presentation of text,
graphics, video, and whatever media it employs. In the simplest case, presentation means automatic retrieval of already available output units, e.g,
canned text or recorded video clips. More ambitious approaches address
generation from scratch. Such approaches incorporate design expertise and
provide mechanisms for the automatic selection, creation, and combination
5
of medium-specic primitives (e.g, words, icons, video frames, etc.). To ensure coherency of presentations, output fragments in dierent media have to
be tailored to each other. Therefore, no matter how medium-specic output will be produced, it is important that the system maintains an explicit
representation of the encodings used.
Output coordination
The last step of the transformation concerns the arrangement of presentation fragments provided by the generators in a multimedia output. A purely
geometrical treatment of this layout task would, however, lead to unsatisfactory results. Rather, layout has to be considered as an important carrier
of meaning. For example, two pictures that serve to contrast objects should
be placed side by side. When using dynamic media, such as animation and
speech, layout design also requires the temporal coordination of output units.
An identication of subtasks as described above gives an idea of the processes that a reporting system has to maintain. The architectural organization of these processes is, however, a crucial issue, especially when striving
for a system that supports various presentation styles. For example, the automatic generation of live presentations calls (1) for an incremental strategy
for the recognition of object movements and assumed intentions, and (2) for
an adequate coordination of recognition and presentation processes. Also,
there are various dependencies between choices in the presentation part. To
cope with such dependencies, it seems unavoidable to interleave the processes
for content determination, mode selection and content realization.
3 VIPS: A Visual Information Presentation
System
The most straightforward approach to building a reporting system is to rely
on existing modules for the interpretation of image sequences, and the generation of multimedia presentations. When conceiving our prototype system,
called Vips, we consequently follow this approach when the reuse of modules
from our previous systems Vitra [Herzog & Wazinski 94] and Wip [Andre
& Rist 94] is possible. In the following, we sketch the processing mechanisms
of Vips' core modules.
3.1 Image Analysis
For technical reasons, we do not directly incorporate a low-level vision component for the processing of the camera data into Vips. Rather, this task
6
is done with the systems Actions [Sung 88] and Xtrack [Koller et al. 92]
that have been developed by our partners at the Fraunhofer Institute for
Information and Data Processing (IITB) in Karlsruhe. Actions recognizes
moving objects within real world image sequences. It performs a segmentation and cueing of moving objects by computing and analyzing displacement
vector elds. The more recent Xtrack system accomplishes a model-based
recognition and classication of rigid objects.
Sequences of up to 1000 images, i.e. 40 seconds play back time, recorded
with a stationary TV-camera during a game in the German professional soccer league, have been evaluated by the Actions system (cf. [Herzog et al.
89]). In this domain, segmentation becomes quite dicult because the moving objects cannot be regarded as rigid and occlusions occur very frequently.
The as yet partial trajectories delivered by Actions are currently used to
synthesize interactively a realistic GSD, with object candidates assigned to
previously known players and the ball. The approach described in [Rohr 94]
for the geometric modeling of an articulated body has been adopted in Vips
in order to represent the players in the soccer domain (cf. [Herzog 92b]). The
stationary part of the GSD, an instantiated model of the static background,
is fed into the system manually.
3.2 Scene Interpretation
Many previous attempts at high-level scene analysis (e.g. [Neumann 89;
Wahlster et al. 83; Walter et al. 88]) are based on an a posteriori interpretation strategy, which requires a complete GSD covering the entire image
sequence as soon as the analysis process starts. Hence, these systems generate retrospective scene descriptions, only.
Greater exibility can be achieved if an incremental strategy is employed
(cf. [Herzog et al. 89; Koller et al. 92; Tsotsos 85]), with a GSD constructed step by step and processed simultaneously as the scene progresses.
Immediate system reactions, as needed for live presentations, and within autonomous systems, are possible, because information about the present scene
is provided, too. In Vips high-level scene analysis includes:
Computation of spatial relations
In the GSD spatial information is encoded only implicitly. In analogy to
prepositions, their linguistic counterparts, spatial relations provide a qualitative description of spatial arrangements of objects. Each spatial relation
characterizes a class of object congurations by specifying conditions, such
as the relative position of objects or the distance between them.
Instead of assigning simple truth values to spatial predications, a measure
of degrees of applicability has been introduced that expresses the extent to
7
which a spatial relation is applicable (cf. [Andre et al. 87]). On the one hand,
more exact scene descriptions are possible since the degree of applicability
can be expressed linguistically (e.g. `directly behind' or `more or less in front
of'). One the other hand, the degree of applicability can be used to select the
most appropriate reference object(s) and relation if an object conguration
can be described by several spatial predications.
Our system is capable of computing topological (e.g. in, near, etc.) as
well as orientation-dependent relations (e.g. left-of, over, etc.). Since the
frame of reference is explicitly taken into account, the system can cope with
the intrinsic, extrinsic, and deictic use of directional prepositions (cf. [Andre
et al. 87; Gapp 94]).
Characterization and interpretation of object movements
When analyzing time-varying image sequences, spatio-temporal concepts can
also be extracted from the GSD. These conceptual units, which we will call
motion events, serve for the symbolic abstraction of the temporal aspects
of the scene. With respect to the natural language description of image
sequences events are meant to represent the meaning of motion and action
verbs.
The recognition of movements is based on event models, i.e., declarative descriptions of classes of higher conceptual units capturing the spatiotemporal aspects of object motions. The event concepts are organized into an
abstraction hierarchy, grounded on specialization (e.g., running is a moving)
and temporal decomposition (cf. Fig. 3). This conceptual hierarchy can also
be utilized to guide the selection of the relevant propositions when producing a presentation. Besides the question of which events are to be extracted
from the GSD, it is decisive how the recognition process is realized. With
respect to the generation of simultaneous multimedia presentations, the following problem becomes obvious. If the presentation is to be focused on
what is currently happening, it is very often necessary to describe object
motions even while they occur. Thus, motion events have to be recognized
stepwise as they progress and event instances must be made available for
further processing from the moment they are noticed rst.
Since the distinction between events that have and those that have not
occurred is insucient, we have introduced the additional predicates start,
proceed, and stop which can be used to characterize the progression of an
event (cf. [Andre et al. 88]).
Labeled directed graphs with edges of a certain type, so called course
diagrams, are used to model the prototypical progression of an event. Fig.
4 shows a simplied course diagram for the concept BALL-TRANSFER. It
describes a situation in which a player passes the ball to a teammate. The
event starts if a BALL-POSSESSION event stops and the ball is free. The
8
Header:
(BALL-TRANSFER ?p1*player ?b*ball ?p2*player)
Conditions:
(eql (TEAM ?p1) (TEAM ?p2))
Subconcepts:
(BALL-POSSESSION ?p1 ?b) [I1]
(MOVE-FREE ?b) [I2]
(BALL-POSSESSION ?p2 ?b) [I3]
Temporal-Relations:
[I1] :meets [BALL-TRANSFER]
[I1] :meets [I2]
[I2] :equal [BALL-TRANSFER]
[I2] :meets [I3]
Figure 3: Event model
event proceeds as long as the ball is moving free and stops when the recipient
has gained possession of the ball.
The recognition of an occurrence can be thought of as traversing the
course diagram, where the edge types are used for the denition of the basic
event predicates. Course diagrams rely on a discrete model of time, which is
induced by the underlying sequence of digitized TV-frames. They allow incremental event recognition, since exactly one edge per unit of time is traversed.
Using constraint-based temporal reasoning, course diagrams are constructed
automatically from interval-based concept denitions (cf. [Herzog 92a]).
:PROCEED
Condition:
(PROCEED (MOVE-FREE ?b) ?t)
S0
:START
S1
Condition:
:STOP
S2
Condition:
(AND
(STOP (BALL-POSS ?p1 ?b) ?t)
(START (MOVE-FREE ?b) ?t))
(AND
(STOP (MOVE-FREE ?b) ?t))
(START (BALL-POSS ?p2 ?b) ?t)
Figure 4: Course diagram
9
Recognition of presumed goals and plans of the observed agents
For human observers the interpretation of visual information also involves
inferring the intentions, i.e. the plans and goals, of the observed agents (e.g.,
player A does not simply approach player B, but he tackles him).
In the soccer domain the inuence of the agents' assumed intentions on
the results of the scene analysis is particularly obvious. Given the position
of players, their team membership and the distribution of roles in standard
situations, stereotypical intentions can be inferred for each situation. We
use the system component described in [Retz-Schmidt 91], which is able to
incrementally recognize intentions of and interactions between the agents as
well as the causes of possible plan failures.
Partially instantiated plan hypotheses taken from a plan library are successively instantiated according to the incrementally recognized events. Each
element of the plan library contains information about necessary preconditions of the (abstract) action it represents as well as information about its
intended eect. A hierarchical organization is achieved through the decomposition and specialization relation. Observable events and spatial relations
constitute the leaves of the plan hierarchy.
Knowledge about the cooperative (e.g., double-pass) and antagonistic behaviour (e.g., oside-trap) of the players is represented in the interaction
library. A successful plan triggers the activation of a corresponding interaction schema.
3.3 Presentation Planning
Following a speech-act theoretic perspective, the generation of multimedia
documents is considered as a goal-directed activity (cf. [Andre & Rist 90]).
Starting from a communicative goal (e.g., describe the scene), a presentation
planner builds up a renement-style plan in the form of a directed acyclic
graph (DAG). This plan reects the propositional contents of the potential
document parts, the intentional goals behind the parts as well as rhetorical
relationships between them (cf. [Andre & Rist 93]). While the top of the
presentation plan is a more or less complex presentation goal, the lowest level
is formed by specications of elementary presentation tasks (e.g., formulating
a request or depicting an object) that are directly forwarded to the mediumspecic design components.
To represent presentation knowledge, we have dened strategies that refer
to both text and picture production. While some strategies reect general
presentation knowledge, others are more domain-dependent and specify how
to present a certain subject. To utilize the plan-based approach in Vips, we
dene new strategies for scene description. For example, the strategy shown
in Fig. 5 may be used to verbally describe a sequence of events by informing
10
the user about the main events (e.g.,team-attack), illustrating them by a
snapshot and to provide more details about the subevents (e.g., kick).
Header: (Describe-Scene S U ?events T)
Eect: (FOREACH ?one-ev
WITH (AND (BEL S (Main-Ev ?one-ev))
(BEL S (In ?one-ev ?events)))
(BMB S U (In ?one-ev ?events)))
Applicability-Conditions:
(BEL S (Temporally-Ordered-Sequence ?events))
Main Acts:
((FOREACH ?one-ev
WITH (AND (BEL S (Main-Ev ?one-ev))
(BEL S (In ?one-ev ?events)))
(Inform S U ?one-ev T)))
Subsidiary Acts:
((Illustrate S U ?ev G)
(Elaborate-Subevents S U ?sub-ev ?medium))
Figure 5: Plan operator for describing a scene
To accomplish the last communicative act, the strategy shown in Fig. 6
may be applied. It informs the user about all salient subevents and provides
more details about the agents involved. To determine the salience of an event,
factors such as its frequency of occurrence, the complexity of its generic event
model, the salience of involved objects and the area in which it takes place
are taken into account (see also [Andre et al. 88]). All events are described
in their temporal order. Further grouping principles for events are discussed
in [Maybury 91].
The strategies dened in Fig. 5 and Fig. 6 can be used to generate a
posteriori scene descriptions. They presuppose that the input data from
which relevant information has to be selected are a priori given. Since both
strategies iterate over complete lists of temporally ordered events, the presentation process cannot start before the interpretation of the whole scene is
completed.
However, Vips is also able to generate live reports. The main characteristic of this kind of presentation is that input data are continuously delivered
by a scene interpretation system and the presentation planner has to react
immediately to incoming data. In such a situation, no global organization of
the presentation is possible. Instead of collecting scene data and organizing
them (e.g., according to their temporal order as in the rst two strategies),
the system has to locally decide which event should be reported next considering the current situation. Such behavior is reected by the strategy shown
11
Header: (Elaborate-Subevent S U ?ev T)
Eect: (FOREACH ?sub-ev
WITH (AND (BEL S (Salient ?sub-ev))
(BEL S (Sub-Ev ?sub-ev ?ev)))
(BMB S U (Sub-Ev ?sub-ev ?ev)))
Applicability-Conditions:
(AND (BEL S (Sub-Events ?ev ?sub-events))
(BEL S (Temporally-Ordered-Sequence ?sub-events)))
Main Acts:
((FOREACH ?sub-ev
WITH (AND (BEL S (In ?sub-ev ?sub-events))
(BEL S (Salient ?sub-ev)))
(Inform S U ?sub-ev T)))
Subsidiary Acts:
((Elaborate-Agents S U ?sub-ev ?medium))
Figure 6: Plan operator for describing subevents
in Fig. 7. In contrast to the strategy shown in Fig. 6, events are selected
for their topicality. Topicality is determined by the salience of an event and
the time that has passed since its occurrence. Consequently, the topicality
of events decreases as the scene progresses. If an outstanding event (e.g.,
a goal kick) occurs which has to be verbalized as soon as possible, the presentation planner may even give up partially planned presentation parts to
communicate the new event as soon as possible.
Header: (Describe-Next S U ?ev T)
Eect: (AND (BMB S U (Next ?preceding-ev ?ev))
(BMB S U (Last-Reported ?ev)))
Applicability-Conditions:
(AND (BEL S (Last-Reported ?preceding-ev))
(BEL S (Topical ?ev *Time-Available*))
(BEL S (Next ?preceding-ev ?ev)))
Main Acts: ((Inform S U ?ev T))
Subsidiary Acts: ((Describe-Next S U ?next-ev T))
Figure 7: Plan operator for simultaneous description
The realization of the main act in Fig. 7 depends on whether the user has
visual access to the scene or not. For example, an utterance, such as \pay
attention to the player in the penalty area" does not make much sense if the
user does not see the scene.
12
3.4 Generating textual presentation parts
As for the event recognition component, the text generator described in [Harbusch et al. 91] follows an incremental processing scheme. It can begin
outputting words before the input is complete. Such generators are more
exible because they can also be used in situations where it is not possible
to delay the output until the input is complete (cf. [Finkler & Schauder 92]).
However, it is no longer guaranteed that new input can always be integrated
into a previously uttered part of a sentence. In such a case, revisions are
necessary.
The rst component that is activated during natural language generation
is the text design component. As soon as the presentation planner decides
that a particular element should be presented as part of a text, the element
is handed over as input to this component. The main task of the text design
component is the organization of input elements into clauses. This comprises
the determination of the order in which the given input elements can be
realized in the text and lexical choice. The results of the text designer are
preverbal messages.
These preverbal messages are forwarded in a piecemeal fashion to the text
realization component where grammatical encoding, linearization and inection take place. The text realization component is based on the formalism
of Lexicalized LD/LP Tree Adjoining Grammars. It associates lexical items
with syntactic rules, permits exible expansion operations and allows the
description of local dominance to be separated from linear precedence rules.
These characteristics made it a good candidate for incremental generation.
3.5 Generating visual presentation parts
In a system like Vips it is quite natural to base the generation of visual presentations on the camera-recorded visual data and on information obtained
from various levels of image interpretation. For example, when generating
live reports, one may include original camera data directly in the presentation. In this case, the graphics generator will only be requested to forward
the camera data to a video window. To deal with more interesting tasks the
system must have appropriate generation techniques at its disposal. For the
current version of Vips, we have developed techniques for:
Content-based search for subsequences
A recorded image sequence can be split into subsequences of arbitrary length
between one image and all images. Content-based search serve to nd such
subsequences according to semantic criteria. For example, one may be interested in the occurrence of a particular event, or the trajectory of a certain
13
agent or object. In contrast to video transcription and presentation systems, such as Ivaps [Csinger & Booth 94], Vips' graphics generator benets
from the connection to the image understanding component. Search specications are formulated on the level of event propositions. Tracing back
the event recognition process, the original image data are localized and the
corresponding subsequences are returned.
Display style modications
When displaying an image or an image sequence, material and temporal aspects may be modied in order to accomplish certain communicative goals, or
to meet situation-dependent constraints, e.g, resource limitations. Concerning the visual appearance of objects, Vips supports photorealistic display
styles (cf. [Herzog 92b]). Such presentations can be realized as ltered displays of the original camera frames. Starting from the propositional GSD,
Vips is also able to produce schematic pictures and animations, likewise in
2D or 3D. In the schematic mode, static background objects are approximated by line drawings/3D models, and moving objects are represented by
predened icons/3D bodies. The generation of 3D animations is an interesting feature, since it allows the \re-recording" of a scene from arbitrary
viewpoints, e.g., from the viewpoints of agents and objects involved. Concerning the temporal aspect of an image sequence, three display modes can
be chosen: true time (25 frames per second), slow motion, and quick motion.
time
[6:57:80]
... Bommer, the midfield player passes the ball
to Bosch, the outside left. Bosch is attacked by Maller, the outside right.
Figure 8: Live presentation
Data aggregation
Showing an original image sequence is often less eective than a presentation with less visual data. This becomes obvious when dynamic concepts
have to be visualized by static graphics which are to be included in a print
14
document. The mere listing of frames is inappropriate because there is a
high risk that an observer wouldn't see the "trees for the forest". Reporting
in mass media gives valuable inspirations for enhancing the eectiveness of
visual presentations by means of aggregation techniques. In Vips, we aim
at operationalizations of such techniques for the production of dynamic and
static visual presentations. For example, recorded video sequences can be
shortened by cutting out less interesting frames. To nd frames which can
be omitted without destroying the presentation, we take into account the
recognized event structure of the sequence and use criteria such as spatial
coherency of objects in subsequent frames. In the case of static graphics, we
also start from the event structure to nd the most signicant key frames of
a sequence. For some purposes a single key frame will suce, e.g, when the
result or outcome of an event has to be shown. In other situations, one may
apply techniques used in technical illustrations to aggregate the information
of several images into a single one. For example to visualize an object trajectory, we start from a key frame that shows the object either in its start
or end position and then superimpose an arrow on the image to trace the
object's locations in succeeding or preceding frames.
Visualization of inferred information
The interpretation of image sequences may lead to information which is not
directly apparent in the raw image data. This does, however, not mean that
inferred information cannot be presented visually. Marking objects/object
groups by color or annotating them with text labels are simple techniques
to include additional information in a graphical presentation. For other purposes, superimposition techniques are more suitable. For example, when
analyzing a soccer game, it may be of interest to wonder whether a player
had alternative moves in a crucial situation. Provided that the image interpretation system is able to recognize such alternatives, they can be visualized
by superimposing hypothetical trajectories on the original scene data.
4 Generation Examples
To give an impression of how Vips works, we present two generation examples
taken from the domain of soccer.
In the rst example, a TV-style live report is to be generated, i.e., a
soccer scene has to be described while the scene progresses. We simulate
the progress of the scene by passing the GSD data incrementally to the
recognition component. We choose as presentation media text and video
which, in this case, means displaying the original image sequence without
further modications. For this kind of presentation, video is considered as
15
the guiding medium to which textual comments have to be tailored.
The example is illustrated by Fig. 8 which shows a part of the coordinated
stream of visual and textual output. To describe the underlying generation
process, we start at the image frame marked by the timestamp [6:57:80] which
is displayed shortly after a preceding utterance has been completed.
To select the next event to be textually communicated, the presentation
planner applies the strategy shown in Fig. 7. When testing the applicability
conditions of this strategy, the variable ?preceding-ev is instantiated with the
last proposition that has been verbalized. To instantiate the variable ?ev, the
presentation planner searches for a topical event that comes after ?precedingev. In the example, ?ev is bound to (Ball-Transfer :Agent player#6 :Object
ball :Recipient nil :Begin [6:57:50]). This event is selected since it is the only
event in which the ball is involved as one of the most salient objects.
After the renement of (Inform S U ?ev T), the following acts have been
posted as new subgoals: four referential acts for specifying the action and
its associated case roles and an elementary surface speech act, S-Inform,
that is passed on to the text designer. Note that the presentation planner
forwards a certain piece of information to the generator concerned as soon
as it has decided which component should encode it. In our example, (SInform S U ...) is sent to the text generator although a content specication
for the recipient is still missing. The text designer creates input for the TAGbased realization component which starts processing this input and generates:
\Bommer, the mideld player, passes the ball ...".
In the meantime, the event recognition component has identied the recipient of the ball (player#7). This new information allows the presentation
planner to determine the following content specication: (the ?z (name ?z
Bosch) (outside-left ?z)). Thus, the incomplete specication that has been
sent to the text generator is supplemented accordingly. In this case, the text
generator is able to complete the sentence just by adding the prepositional
phrase \to Bosch, the outside left.". Of course, there are also situations in
which revisions are necessary (cf. [Wahlster et al. 93]).
Meanwhile, the presentation planner has again applied the strategy shown
in Fig. 7, and ?ev is bound to (Attack :Agent player#20 :Patient player#7).
After completing the last sentence, the realization component generates \He
is attacked by Maller, the outside right."
In the second example, we assume that a retrospective description of a
past scene is to be generated in a format that can be printed on paper. In
this case, the system has to accomplish the goal (Describe-Scene S U ?events
T) whereby the variable ?events is bound to a list of temporally ordered
events delivered by the recognition component. The presentation planner
rst determines the main events and forwards a content specication to the
text generator. In addition, it requests the graphics generator to illustrate
the course of the event. Since only static graphics can be printed on paper,
16
In the 15th minute,
team A started an
attack. Bösel (8), the
outside left, centered
the ball to Britz (9) in
front of the goal. The
goal keeper (1) intercepted the ball.
Figure 9: A posteriori report
it's not possible to include the original video sequence in the presentation.
Therefore, the graphics designer starts with a snapshot showing the positions
of the players at the beginning of the events and relies on data aggregation
to encode the trajectories of the moving objects (cf. Fig. 9). During the generation of the illustration, the presentation planner has expanded (ElaborateSubevents S U ?sub-ev ?medium) to determine which information about the
subevents should be communicated to the user. In order to facilitate referent
identication, the system has attached the numbers used in the icons to the
expressions referring to the players.
17
5 Summary
In this contribution, we have reported on our eorts to bridge from computer vision to multimedia generation. We have outlined the system Vips
that takes camera recorded image sequences as input and uses incremental
strategies for the recognition of higher-level concepts such as spatial relations, motion events and intentions, and relies on a plan-based approach to
communicate recognized occurrences with multiple presentation media.
Implementations of most of the core modules (scene interpretation, presentation planner, text generator,) are already available and allow the automatic generation of textual descriptions for short image sequences. The
knowledge-base of the system currently consists of about 100 concept denitions for spatial relations, motion events, plans, and plan interaction schemata.
As yet, the graphics generation component only provides the basic functions
(display of video sequences/single frames, icon-based visualization of trajectory data). To generate presentation examples, as presented in section 4,
interfacing between some components still has to be done manually. Our
current eorts aim at a fully integrated version of the Vips system with
improved graphics capabilities.
Perhaps the most interesting topic for further research is the bidirectional
interleaving of image interpretation and presentation planning. In some situations, it would be useful for the presentation planner to request particular
information from the interpretation system which may eventually force the
vision system to actively achieve conditions under which this information
can be obtained, e.g., by changing the sensor parameters. Active vision is
particularly required when information is missing which is needed to decide
whether an applicability condition of a presentation strategy is satised or
not. Furthermore, this feature could be used for the generation of visual
presentation fragments - one simply drives the camera to obtain a certain
picture or video clip.
Acknowledgements
The work described in this paper was partly supported by the Special Collaborative Program on AI and Knowledge-based Systems (SFB 314), project
VITRA, of the German Science Foundation (DFG) and by the German Ministry for Research and Technology (BMFT) under grant ITW8901 8, project
WIP. We would like to thank Wolfgang Wahlster who as a leader of both
projects made this cooperation possible.
18
References
[Andre & Rist 90] E. Andre and T. Rist. Towards a Plan-Based Synthesis
of Illustrated Documents. In: Proc. of the 9th ECAI, pp. 25{30,
Stockholm, 1990.
[Andre & Rist 93] E. Andre and T. Rist. The Design of Illustrated Documents as a Planning Task. In: M. T. Maybury (ed.), Intelligent
Multimedia Interfaces, pp. 94{116. Menlo Park, CA: AAAI Press,
1993.
[Andre & Rist 94] E. Andre and T. Rist. Generating Coherent Presentations Employing Textual and Visual Material. Articial Intelligence
Review Journal, 8(3), 1994.
[Andre et al. 87] E. Andre, G. Bosch, G. Herzog, and T. Rist. Coping
with the Intrinsic and the Deictic Uses of Spatial Prepositions. In:
K. Jorrand and L. Sgurev (eds.), Articial Intelligence II: Methodology, Systems, Applications, pp. 375{382. Amsterdam: North-Holland,
1987.
[Andre et al. 88] E. Andre, G. Herzog, and T. Rist. On the Simultaneous Interpretation of Real World Image Sequences and their Natural
Language Description: The System SOCCER. In: Proc. of the 8th
ECAI, pp. 449{454, Munich, 1988.
[Arens et al. 93] Y. Arens, E. Hovy, and S. van Mulken. Structure and
Rules in Automated Multimedia Presentation Planning. In: Proc. of
the 13th IJCAI, pp. 1253{1259, Chambery, France, 1993.
[Csinger & Booth 94] A. Csinger and K. S. Booth. Reasoning about Video:
Knowledge-based Transcription and Presentation. In: J. F. Nunamaker and R. H. Sprague (eds.), HICSS-94, volume III, Information
Systems: Decision Support and Knowledge-based Systems, pp. 599{
608, Maui, HI, 1994.
[Feiner & McKeown 93] S. K. Feiner and K. R. McKeown. Automating
the Generation of Coordinated Multimedia Explanations. In: M. T.
Maybury (ed.), Intelligent Multimedia Interfaces, pp. 117{138. Menlo
Park, CA: AAAI Press, 1993.
[Finkler & Schauder 92] W. Finkler and A. Schauder. Eects of Incremental Output on Incremental Natural Language Generation. In: Proc.
of the 10th ECAI, pp. 505{507, Vienna, 1992.
19
[Gapp 94] K.-P. Gapp. Basic Meanings of Spatial Relations: Computation
and Evaluation in 3D Space. In: Proc. of AAAI-94, pp. 1393{1398,
Seattle, WA, 1994.
[Grice 75] H. P. Grice. Logic and Conversation. In: P. Cole and J. L.
Morgan (eds.), Speech Acts, pp. 41{58. London: Academic Press,
1975.
[Harbusch et al. 91] K. Harbusch, W. Finkler, and A. Schauder. Incremental Syntax Generation with Tree Adjoining Grammars. In:
W. Brauer and D. Hernandez (eds.), Verteilte Kunstliche Intelligenz
und kooperatives Arbeiten: 4. Int. GI-Kongre Wissensbasierte Systeme, pp. 363{374. Berlin, Heidelberg: Springer, 1991.
[Herzog & Wazinski 94] G. Herzog and P. Wazinski. VIsual TRAnslator:
Linking Perceptions and Natural Language Descriptions. Articial
Intelligence Review, 8(2/3):175{187, 1994.
[Herzog et al. 89] G. Herzog, C.-K. Sung, E. Andre, W. Enkelmann, H.H. Nagel, T. Rist, W. Wahlster, and G. Zimmermann. Incremental Natural Language Description of Dynamic Imagery. In: C. Freksa
and W. Brauer (eds.), Wissensbasierte Systeme. 3. Int. GI-Kongre,
pp. 153{162. Berlin, Heidelberg: Springer, 1989.
[Herzog 92a] G. Herzog. Utilizing Interval-Based Event Representations for
Incremental High-Level Scene Analysis. In: M. Aurnague, A. Borillo, M. Borillo, and M. Bras (eds.), Proc. of the 4th International
Workshop on Semantics of Time, Space, and Movement and SpatioTemporal Reasoning, pp. 425{435, Ch^ateau de Bonas, France, 1992.
[Herzog 92b] G. Herzog. Visualization Methods for the VITRA Workbench.
Memo 53, Universitat des Saarlandes, SFB 314 (VITRA), 1992.
[Koller et al. 92] D. Koller, N. Heinze, and H.-H. Nagel. Algorithmic
Characterization of Vehicle Trajectories from Image Sequences by Motion Verbs. In: Proc. of IEEE Conf. on Computer Vision and Pattern
Recognition, pp. 90{95, Maui, Hawaii, 1992.
[Maybury 91] M. T. Maybury. Planning Multisentential English Text Using
Communicative Acts. PhD thesis, Rome Air Development Center, Air
Force Systems Command, Gris Air Force Base, NY, 1991.
[Maybury 93] M. T. Maybury. Planning Multimedia Explanations Using
Communicative Acts. In: M. T. Maybury (ed.), Intelligent Multimedia Interfaces, pp. 60{74. 1993.
20
[Neumann 89] B. Neumann. Natural Language Description of TimeVarying Scenes. In: D. L. Waltz (ed.), Semantic Structures: Advances in Natural Language Processing, pp. 167{207. Hillsdale, NJ:
Lawrence Erlbaum, 1989.
[Retz-Schmidt 91] G. Retz-Schmidt. Recognizing Intentions, Interactions,
and Causes of Plan Failures. User Modeling and User-Adapted Interaction, 1:173{202, 1991.
[Rohr 94] K. Rohr. Towards Model-based Recognition of Human Movements
in Image Sequences. Computer Vision, Graphics, and Image Processing (CVGIP): Image Understanding, 59(1):94{115, 1994.
[Roth et al. 91] S. F. Roth, J. Mattis, and X. Mesnard. Graphics and Natural Language as Components of Automatic Explanation. In: J. W.
Sullivan and S. W. Tyler (eds.), Intelligent User Interfaces, pp. 207{
239. New York, NY: ACM Press, 1991.
[Stock 91] O. Stock. Natural Language and Exploration of an Information
Space: The ALFresco Interactive System. In: Proc. of the 12th IJCAI,
pp. 972{978, Sidney, Australia, 1991.
[Sung 88] C.-K. Sung. Extraktion von typischen und komplexen Vorgangen
aus einer langen Bildfolge einer Verkehrsszene. In: H. Bunke,
O. Kubler, und P. Stucki (Hrsg.), Mustererkennung 1988, pp. 90{96.
Berlin, Heidelberg: Springer, 1988.
[Tsotsos 85] J. K. Tsotsos. Knowledge Organization and its Role in Representation and Interpretation for Time-Varying Data: the ALVEN
System. Computational Intelligence, 1:16{32, 1985.
[Wahlster et al. 83] W. Wahlster, H. Marburger, A. Jameson, and
S. Busemann. Over-answering Yes-No Questions: Extended Responses in a NL Interface to a Vision System. In: Proc. of the 8th
IJCAI, pp. 643{646, Karlsruhe, FRG, 1983.
[Wahlster et al. 93] W. Wahlster, E. Andre, W. Finkler, H.J. Protlich,
and T. Rist. Plan-Based Integration of Natural Language and Graphics Generation. Articial Intelligence, 63:387{427, 1993.
[Walter et al. 88] I. Walter, P. C. Lockemann, and H.-H. Nagel. Database
Support for Knowledge-Based Image Evaluation. In: P. M. Stocker,
W. Kent, and R. Hammersley (eds.), Proc. of the 13th Conf. on Very
Large Databases, Brighton, UK, pp. 3{11. Los Altos, CA: Morgan
Kaufmann, 1988.
21
Download