Approximate World Models: Incorporating Qualitative ... Linguistic Informat ion into Vision ...

advertisement
From: AAAI-96 Proceedings. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
Approximate World Models: Incorporating Qualitative and
Linguistic Informat ion into Vision Systems
Claudio S. Pinhanez and Aaron F. Bobick
Perceptual Computing Group - MIT Media Laboratory
20 Ames St. - Cambridge, MA 02139
pinhanez - bobick@media.mit .edu
Abstract
Approxhate
world models are coarse descriptions
of the elements of a scene, and are intended to be
used in the selection and control of vision routines
in a vision system. In this paper we present a control architecture
in which the approximate
models represent the complex relationships
among the
objects in the world, allowing the vision routines
to be situation or context specific. Moreover, because of their reduced accuracy requirements,
approximate
world models can employ qualitative
information
such as those provided by linguistic
descriptions
of the scene. The concept is demonstrated in the development
of automatic
cameras
for a TV studio - SmartCams. Results are shown
where SmartCams
use vision processing
of real
imagery and information
written in the script of
a TV show to achieve TV-quality
framing.
A main feature of approximate world models is that
their imprecision facilitates the use of incomplete and
inaccurate sources of information such as linguistic descriptions of the elements and actions in a scene. In
fact, we show in this paper that linguistic information
can play a pivotal role in providing the contextual information needed to simplify the vision tasks.
We are employing approximate world models in the
development of SmartCams,
automatic TV cameras
able to frame subjects and objects in a studio according to the verbal requests from the TV director. Our
SmartCams are tested in the domain of a cooking show.
The script of the show is available to SmartCams (in
a particular format), and the cameras are shown to
be able to produce TV-quality framing of subjects and
objects.
Approximate World Models
Introduction
It has been argued - e.g. in (Strat & Fischler 1991)
- that in any given situation most visual tasks can
be performed by a relatively simple visual routine. For
example: finding the ground reduces to finding a large
(body-relative) horizontal plane if the observer is vertical and there are no other large horizontal planes.
The difficulty in general, of course, is how to know the
current state of the world without having to do all the
detailed visual tasks first. The numerous possibilities
for the fundamental relationships between objects in
the scene is as much responsible for the complexity of
vision as is the difficulty of the visual routines themselves.
The goal of this paper is to argue that vision systems
should separate these two sources of complexity by using coarse models of the objects in the scene called
approximate
world models. This proposal is based on
the observation that the real world does not need to be
fully and accurately understood to detect many situations where a specific vision method is likely to succeed
or fail. For instance, full and precise 3D reconstruction
of the human body is not necessary to detect occlusion
if the objective of the system is just to recognize faces
of people walking through a gate.
1116
Perception
Approximate
world models are coarse descriptions of
the main elements of a scene to be used in the selection and control of vision routines. These models are to
be incorporated into vision-based systems built from a
collection of different, simple, task-specific vision routines whose application is controlled according to the
conditions described by the approximate world models.
This proposal comes from the observation that
a common reason for the failure of vision routines
- especially, view-based methods - is related to the
complex geometric relationships among objects in the
world. For example, often template-based tracking
routines produce wrong results due to partial occlusion. In such situations, a crude 3-D reconstruction
of the main objects in the scene can determine if the
tracked object is in a configuration where occlusion is
probable.
The advantages of using approximate models are at
least three-fold. First, coarse reconstruction of the 3-D
is arguably within the grasp of current computer vision
capabilities. Second, as shown in this paper, control of
task-specific vision routines can be based on inaccurate and incomplete information.
And third, as we
will demonstrate, reducing the accuracy requirements
enables the use of qualitative information which might
be available to the vision system.
Coarse and/or hierarchical descriptions have been
used before in computer vision (Bobick & Bolles 1992;
Marr & Nishihara 1978). Particularly, Bobick and
Bolles employed a multi-level representational system
where different queries were answered by different representations of the same object. Part of the novelty of
our work is related to the use of the models in the dynamic selection of appropriate vision methods according to the world situation. And compared to other architectures for context-based vision systems, like Strat
and Fischler’s Condor system, (Strat & Fischler 1991),
approximate world models provide a much more clear
distinction between the vision component and the 3-D
world component.
It is interesting to situate our scheme in the ongoing debate about reconstructionist vs. purposive vision
discussed in (Tarr & Black 1994) and in the replies in
the same issue. Our proposal falls between the strictly
reconstructionist and purely purposive strategies. We
are arguing that reconstruction should exist at the approximate level to guide purposive vision routines: by
building an approximate model of the scene vision systems can use task-specific, purposive vision routines
which work reliably in some but not all situations.
Using Approximate World Models
in Vision Systems
To formalize the control exercised by the approximate
world models in a vision system, we define applicability
conditions for each vision routine: the set of assumptions, that, if true in the current situation, warrants
faith in the correctness of the results of that routine.
The idea is to have each vision routine encapsulated
in an applicability rule, which describes pre-conditions
(IF portion of the rule), application constraints (THEN
portion), and post-conditions (TEST IF),in terms of
general properties about the targeted object, other objects in the scene, the camera’s point of view, and the
result of the vision routine.
As an example, fig. 1 depicts the applicability rule of
a vision routine, “extract-narrowest-movingblob”,
and how the rule is applied in a given situation.
is
The routine “extract-narrowest-moving-blob”
designed to detect moving regions in a sequence of
two consecutive frames using simple frame differencing, and then to divide the result into two areas, of
which the narrowest is returned.
To use such a rule, the vision system consults the approximate model of TARGET (in the example case, the
head of a person) to obtain an estimation of its projection into the image plane of the camera, and also
to confirm if TARGET is moving. Moreover, the system
also looks for other moving objects close to TARGET,
constructing an approximate view model of the camera’s view.
If the conditions are satisfied by the approximate
view model, the routine is applied on the imagery ac-
Approximate
View Model
TARQET
inside
moving
object
view
THEiN
APPLY vision routine
Vuctract-narrowemtmoving-blob"
TEST
IF
Final Result
Figure 1: Example of an applicability rule for a vision routine and how the information from the approximate world
model is used.
cording to the instructions in the THEN portion of the
applicability rule which may also include information
about routine parameters. The RESULT is then checked
in the TEST IF portion of the rule, reducing the possibility that an incomplete specification of pre-conditions
generates incorrect results. For instance, in the case of
“extract-narrowest-moving-blob”,
often the lack of
actual object movement makes the routine return tiny,
incorrectly positioned regions which are filtered out by
the post-conditions.
It is important to differentiate the concept of applicability rules from rule-based or expert-system approaches to computer vision (Draper et al. 1987;
Tsotsos 1985). Although we use the same keywords
(IF, THEN), the implied control structure has no resemblance to a traditional rule-based system: there is
no inference or chaining of results. More examples of
applicability rules can be found in (Bobick & Pinhanez
1995).
A Working Example:
SmartCams
Our approach of using approximate world models is being developed in a system we are constructing for the
control of TV cameras. The ultimate objective is to
develop a camera for TV that can operate without the
cameraman, changing its attitude, zoom, and position
to provide specific images upon human request. We
A “cooking
call these robot-like cameras SmartCams.
show” is the first domain in which we are experimenting with our SmartCams.
The basic architecture of a SmartCam is shown in
fig. 2. Considering the requirements of this application,
Vision 1117
0
TV director
SmartCam
---
’
APPROXIMATE WORLD MODEL
Figure 3: Example of response to the call “close-up
chef”
by two different cameras,
side and center.
The
left images show the projection
of the approximate
models
on the wide-angle images. The right images display the result of vision routines as highlighted regions, compared
to
the predicted position according to the approximate
model
of the head and the trunk, shown as rectangles.
scri
wide-angle
images
Figure 2: The architecture
of a SmartCam.
The bottom
part of the figure shows the structure
of the modules responsible for maintaining
the approximate
world models.
world model represents the subjects
approximate
and objects in the scene as 3-D blocks, cylinders, and
ellipsoids, and uses symbolic frame-based representations (slots and keywords). The symbolic description
of an object includes information about to which class
of objects the object belongs, its potential use, and its
roles in the current actions.
The 3-D representations are positioned in a 3-D virtual space corresponding to the TV studio. The cameras calibration parameters area also approximately
known. The precision can be quite low, and in our
system the position of an object might be off by an
amount comparable to its size.
For example, if there is a bowl present in the studio,
its approximate world model is composed by a 3-D geometric model of the bowl and by a frame-like symbolic
description. The 3-D geometric model approximates
the shape of the bowl, and is positioned in the virtual space according to the available information. The
objects in the approximate world model belong to different categories. For example, a bowl is a member of
objects”
category.
As so, its frame
the “handleable
the
1118
Perception
includes slots which describe whether the bowl is being
handled by a human, and, if so, there is a slot which
explicitly points to him/her.
The system which produced the results shown in this
paper does not use real moving cameras, but simulates
them using a moving window on wide-angle images of
the set. Several performances of a 5-minute scene as
viewed by three wide-angle cameras were recorded and
digitized. The SmartCam output image is generated
by extracting a rectangular window of some size from
the wide-angle images.
In fig. 3 we can see a typical result in the SmartCam domain where the inaccuracy of the approximate world models does not affect the final results
obtained by the vision routines. Two SmartCams,
side and center, were tasked to provide a close-up of
the chef. Although the geometric model corresponding to the chef is quite misaligned, as can be seen by
its projection into the wide-angle images of the scene
(left side), both Smartcams, using routines similar to
produce correct
“
results (the highlighted areas on the right of fig.3).
Other examples of applicability rules and results can
be found in (Bobick & Pinhanez 1995).
Building and Maintaining Approximate
World Models
Having shown how the information contained in approximate world models can be exploited by a vision
system performing tasks in a dynamic environment, a
fundamental issue remains: how does one construct an
initial model and then maintain such a model as time
progresses and the scene changes.
Contextual and semantic information is rarely employed in model construction because of its inability to
provide accurate geometric data. If the geometric accuracy requirements are relaxed, as it is in the case of
approximate world models, semantic information can
be used to predict possible positions and attitudes of
objects.
Furthermore, we view one of the roles of contextual
knowledge to be that of providing the basic relationships between objects in a given configuration of the
world. For example, while pounding meat the chef remains behind the table, positioned near the cutting
board, and the meat mallet is manipulated by the
hands. As long as such a context remains in force,
these relationships hold, and the job of maintaining the
approximate world model reduces to, predominantly,
a problem of tracking incremental changes within a
known situation.
Occasionally, though, there are major changes in the
structure of the world which require a substantial update in the structure of the approximate models. For
example, a new activity may have begun, altering most
expectations, including, for example, the possible location of objects, which objects are possibly moving, and
which objects are likely to be co-located making visual
separation unlikely (and, more interestingly, unnecessary). In our view, the intervals between major contextual changes
correspond
to the fundamental
actions
performed by the subjects in the scene; the contextual shifts themselves reflect the boundaries between
actions.
Thus, maintaining approximate world models requires two different methods of updating: one related to the tracking of incremental changes within a
fixed context, and the other responsible for performing the substantial changes in the context required at
the boundaries between different actions. Of course, it
is also necessary to have methods to obtain the initial
approximate model. The three required mechanisms
are briefly discussed below.
Initializing
Approximate
World
Models
Whenever a vision system is designed using approximate models, it is necessary to face the issue of how
the models are initialized when the system is turned
on. Current computer vision methods can be employed
in the initialization process, if the system is allowed a
reasonable amount of time to employ powerful vision
algorithms without contextual information.
In our current version of the SmartCams the 3-D
models of the subjects and objects are determined and
positioned manually in the first frame of the scene. All
changes to the model after the first frame are accomplished automatically using vision and processing the
linguistic information as described later in this paper.
Tracking
Incremental
Changes
As proposed, while a particular action is taking place
it is presumed that the basic relationships among the
subjects and objects do not change. During such periods most of the updating of the approximate models can be accomplished by methods able to detect incremental changes, such as visual tracking algorithms.
Though simple, those small updates are vital to maintain the approximate model in an useful state, since
approximate position information is normally an essential part of the applicability conditions of the taskspecific vision routines.
In the SmartCam domain, the update of the incremental changes in the approximate world model, and
especially in its 3-D representations, is accomplished
by vision tracking routines able to detect movements of
the main components of the scene, as shown in the bottom part of the diagram in fig. 2. The two-dimensional
motions of an object detected by each of the wide-angle
cameras are integrated to determine the movement of
the object in the 3-D world. More details can be found
in (Bobick & Pinhanez 1995).
Note that the use of an approximate world model
may require additional sensing and computation which
might not be performed to directly address current perceptual tasks: e.g. the position of the body of the
chef is maintained even though the current task may
only involve framing the hands. We believe that this
additional cost is compensated by the increase in the
competence of the vision routines.
Actions
and Structural
Changes
Tracking algorithms are likely to fail whenever there
is a drastic change in the structure of the scene, as,
for example when a subject leaves the scene. If this
situation is recognized, the tracking mechanism of that
subject should be turned off avoiding false alarms.
It is not the objective of this paper to argue
about the different possible meanings of the term action. Here, actions refer to major segments of time
which people usually describe by single action verbs
as discussed by (Newtson, Engquist, & Bois 1977;
Thibadeau 1986; Kalita 1991).
A change in the action normally alters substantially
the relationships among subjects and objects. For example, we have a situation in the cooking show domain
where the chef first talks to the camera, and then he
picks up a bowl and starts mixing ingredients. While
the action “talking” is happening, there is no need for
the system to maintain explicit 3-D models for the
arms and the hands of the chef: the important elements involved are the position and direction of the
head and body.
When the chef starts mixing ingredients, it is essential that the approximate world model includes his
hands, the mixing bowl, and the ingredients. Fortunately, “mixing” also sets up clear expectations about
the positions of the hands in relation to the trunk of the
Vision 1119
chef, to the mixing bowl, to the ingredients’ containers, and to table where they are initially on. As this
example illustrates, known contextual changes actually
improve the robustness of the system since with each
such event the system becomes “grounded” : there is
evidence independent of, and often more reliable than,
the visual tracking data about the state of the scene.
Linguistic Sources of Action
Information
Without constraints from the domain, determining the
actions which might occur can be difficult. However, in
many situations the set of actions is severely restricted
by the environment or by the task, as, for example,
in the case of recognizing vehicle maneuvers in a gasstation as described in (Nagel 1994 5).
The SmartCam domain exemplifies another type of
situation, one in which there is available a linguistic
description of the sequence of actions to occur. In a
TV studio, the set of occurring actions - and, even,
the order of the actions - is determined by the script
of the show. We find the idea of having vision systems
capable of incorporating linguistic descriptions of actions very attractive: linguistic descriptions of actions
are the most natural to be generated by human beings.
In particular, this form of is suitable in semi-automated
vision-based systems.
The use of linguistic descriptions requires their automatic translation into the system’s internal representational of actions. This is mostly a Natural Language
Processing issue, although the feasibility is certainly
dependent on the final representation used by the vision system. In the SmartCam domain the final objective is to employ TV scripts in the format they are
normally written (see figure 4).
Representing
Actions
Many formalisms have been developed to represent action, some targeting linguistic concerns (Schank 1975),
computer graphics synthesis (Kalita 1991), or computer vision recognition (Siskind 1994 5). Currently
we employ a simple representation based on Schank’s
conceptualizations
as described in (Schank 1975). In
spite of its weaknesses - see (Wilks 1975) - Schank’s
representation scheme is interesting for us because the
reduced number of primitive actions helps the design
of both the translation and the inference procedures.
Our representation for actions uses action frames, a
frame-based representation where each action is represented by a frame whose header is one of Schank’s
primitive actions PROPEL, MOVE, INGEST, GRASP,
EXPEL, PTRANS, ATRANS, SPEAK, ATTEND, MTRANS,
MBUILD - plus the attribute indexes HAVE and CHANGE,
and an undetermined action DO.
Figure 5 contains two examples of action frames.
The figure contains the representation for two actions of the script shown in fig. 4, ‘t chef wraps
chicken with a plastic
bag” and “chef
pounds
1120
Perception
Cooking-show
scenario with a table on which there are
bowls, ingredients,
and different kitchen u tens&.
A mi-
crowave oven is in the back. Cam1 is a centered
camera,
cam2 is a left-sided camera, cam3 is a camera mounted
in the ceiling. Chef is behind the table, facing caml.
...
Chef turns back to cam1 and mixes
parsley, paprika, and basiI in a bowl.
bread-crumbs,
‘Stir together 3 tablespoons of fine dry bread crumbs,
2 teaspoons snipped parsley, l/4 teaspoon paprika,
and l/8 teaspoon dried basil, crushed. Set aside.”
Chef wraps chicken with a plastic bag.
“Place one piece of chicken, boned side up, between
two pieces of clear plastic wrap.”
Chef puts the chicken on the chopping-board
shows how to pound the chicken.
and
“Working from the center to the edges, pound lightly
with a meat maldet, forming a rectangle with this
thickness. Be gentle, meat become as soft as you treat
it.”
Chef pounds
the chicken
with a meat-m&et.
...
Figure 4: The script of a TV cooking show.
the chicken with a meat-mallet”.
Each action
frame begins with a primitive action and contains different slots which supply the essential elements of the
action. Undetermined symbols begin with a question
mark (?); those symbols are defined only by the relationships they have with other objects. The actor
slot determines the performer of the action while the
object slot contains the object of an action or the
owner of an attribute. The action frames resulting
from an action are specified in the result slot, and
action frames which are part of the definition of another action frame are contained in instrument slots.
In the first example, “wrapping” is translated as
some action (unspecified by DO) whose result is to make
chicken be both contained in and in physical contact
with a plastic bag. In the second example, “pounding” is represented as an action where the chef propels
a meat mallet from a place which is not in contact with
the chicken to a place which is in contact, and whose
result is an increase in the flatness of the chicken.
The current version of our SmartCams translates a
simplified version of the TV script of fig. 4 into the
action frames of fig. 5 using a domain-specific, very
simple parser. Building a translator able to handle
more generic scripts seems to be clearly a NLP problem, and, as such, it is not a fundamental point in our
research.
Part of our current research is focused on designing a
better representation for actions than the action frames
. . "chef wraps chicken with a plastic bag"
ii0
(actor chef)
(result
(change (object chicken)
(to (and
(contained plastic-bag)
(physical-contact plastic-bag))))))
. . "chef pounds the chicken with a meat-mallet"
ipropel
(actor chef)
(object meat-mallet)
(from (location ?not-in-contact))
(to (location ?-in-contact))
(result
(change
(object chicken)
(from (flatness ?X))
(to (flatness (greater ?X))>))
(instrument
(have (object ?not-in-contact)
(attribute
(negation (physical-contact chicken)))))
(instrument
(have (object ?in-contact)
(attribute (physical-contact chicken))))
(instrument
(have (object chicken)
(attribute
(physical-contact chopping-board)))))
Figure 5: Action frames correspondingto two actions
from the scriptshown in fig.4.
described in this paper. We are still debating the convenience of using Schank’s primitives to describe every
action. Also, action frames need to be augmented by
incorporating at least visual elements, as in (Kalita
1991), and time references, possibly using Allen’s interval algebra, (Allen 1984).
Using Action
Script
Frames
Obtained
From a
From the examples shown above, it is clear that linguistic descriptions of actions obtained from scripts do
not normally include detailed information about the
position, attitude, and movement of the persons and
objects in the scene. Linguistic accounts of actions normally describe only the essential changes in the scene,
but not the implications of those changes.
Therefore, to use information from scripts it is necessary to have an inference mechanism capable of extracting the needed details from the action frames. In
particular, to use the action frames generated from the
TV script in the SmartCam domain, it was necessary
to implement a simple inference system. It is important to make clear that our inference system is ex-
tremely simple and designed only to meet the demands
of our particular domain. The system was designed
to infer position and movement information about human beings’ hands, and physical contact and proximity
among objects.
The inference system is based on Rieger’s inference
system for Schank’s conceptualizations,
(Rieger III
1975). The inferred action frames are sub-actions, or
instrument actions of the actions from which they are
produced. To guarantee termination in a fast time,
the inference rules are applied in a pre-determined sequence, in a l-pass algorithm.
As a typical case, the system uses as its input the
action frame corresponding to the sentence ’‘chef
wraps chicken with a plastic
bag’ ’ (as shown in
fig. 5) and deduces that the chef’s hands are close to
the chicken. The appendix depicts a more complex
example where from the action of pounding the system obtains the fact that the hands are close to the
chopping board. From the PROPEL actionframe shown
in fig. 5, the inference system deduces some contact
relations between some objects, which imply physical
proximity.
The SmartCam’s inference system is certainly very
simple and works only for some scripts. However, the
ability of approximate models to handle inaccurate information helps the system to avoid becoming useless
in the case of wrong inferences. For instance, in the
“pounding” example, only one of the hands is in fact
close to the chopping board: the hand which is grasping the meat mallet is about 1 foot from the board.
But, as we have seen above, such errors in positioning
are admissible in the approximate world model framework.
etermining
the Onset
of Actions
In the examples above, the information from the script
was represented and augmented by simple inference
procedures. However, to use script information it is
necessary to “align” the action frames with the ongoing action, that is, the vision-based system need to
recognize which action is happening in any given moment in time.
For the work presented here we have relied on manual alignment of the action frames to the events in the
scene. All the results shown in this paper use a timed
script, an extension of the script of fig. 4 which includes
information about when each of the action is happening. This is simplified by the fact that the we are using
the simulated version of the SmartCams where the visual data is pre-recorded, enabling manual annotation.
The problem of visual recognition of actions is also a
current object of our research. The alignment problem
mentioned above can be viewed as a sub-problem of the
general action recognition problem where the order of
the actions is known in advance.
Vision 1121
SmartCams in Action
The current version of our SmartCams handles three
types of framing (close-ups, medium close shots,
medium shots) for a scenario consisting of the chef
All the results obtained so
and about ten objects.
far employ only very simple vision routines similar to
“
7based on movement detection by frame differencing.
Figure 6 shows typical framing results obtained by
the system. The leftmost column of fig. 6 displays some
frames generated in response to the call “
chef . The center column of fig. 6 contains another
sequence of frames, showing the images provided by
hands
the SmartCam tasked to provide “
The rightmost column of fig. 6 is the response
hands
In this situato a call for a “
pounds the chicken with a
tion, the action “
meat-mallet 2) is happening. As shown above, this
action determines that the hands must be close to the
chopping board. This information is used by the system to initialize expectations for the hands in the beginning of the action (both in terms of position and
movement), enabling the tracking system to detect the
hands position based solely in movement information.
One 80-second long video sequence is shown in the
videotape distributed with these proceedings. It is also
available on the WWW-web at:
http://www-white.media.mit.edu/
vismod/demos/smartcas/smartcams.html
The web-site also contains another performance of the
same script where the chef is wearing glasses, and the
actions are performed in a faster speed. The sequences
were obtained by requesting the SmartCams to perform specific shots (displayed as subtitles); the cuts between cameras were selected manually. Both sequences
clearly show that acceptable results can be obtained by
our SmartCams in spite of the simplicity of the vision
routines employed.
Conclusion
Approximate world models made the development of
SmartCams feasible. Using the information about actions from the script of the show and the control information in the approximate world model, it has been
possible to employ simple, fast - sometimes unreliable
- vision routines to obtain the information required
by TV framing.
One of the major accomplishments of our research
is the end-to-end implementation of a system able to
deal with multiple levels of information and processing. A SmartCam is able to use contextual information about the world from the text of a TV script, and
to represent the information in a suitable format (approximate world models); updating the world model
is accomplished through visual tracking, and the approximate world models are used in the selection and
control of vision routines, whose outputs control the
movement of a simulated robotic camera. The system
1122 Perception
Responses
to the calls “close-up
chef”,
hands”, and “close-up
hands”. Refer to back“close-up
ground objects to verify the amount of correction
needed
to answer those calls appropriately.
The grey areas to the
hands” seright of the last frames of the first “close-up
quence correspond
to areas outside of the field of view of
the wide-angle image sequence used by the simulator.
Figure 6:
processes real image sequences with a considerable degree of complexity, runs only one order of magnitude
slower than real time, and produces an output of good
quality in terms of TV standards.
Appendix
Inferences in the “Pounding” Example
The following is a manually commented printout of the
action frames generated by the SmartCam’s inference
system, using as input the action frame corresponding
to the sentence "chef pounds the chicken with a
meat-mallet ’‘.
Only the relevant inferences are
shown from about 80 action frames actually generated.
The transitive rule used for the inference of proximity
is sensitive to the size of the objects, avoiding its use
if one of the objects is larger than the others.
0 : action frame obtained from the script
(propel
(actor chef)
(object meat-mallet)
jto<location
?in-contact))
(from (location ?not-in-contact))
(result
(change (object chicken)
(from (flatness ?X))
(to (flatness (greater ?X)))))
(instrument
(have
(object ?in-contact)
(attribute (phys-cant chicken))))
(instrument
(have
(object ?not-in-contact)
(attribute (negation (phys-cant chicken)))))
(instrument
(have
(object chicken)
(attribute (phys-cant chopping-board)))))
1 : propelling
(grasp
an object
(0) requires grasping
(actor chef)
(object meat-mallet)
(to hands))
2 : grasping (1) requires physical-contact
(have
(object hands)
(attribute (phys-cant meat-mallet)))
3 : physical-contact (0) implies proximity
(have
(object ?in-contact)
(attribute (proximity chicken)))
4 : physical-contact (0) implies proximity
(have
(object chicken)
(attribute (proximity chopping-board)))
5 : physical-contact (2) Implies proximity
(have
(object hands)
(attribute (proximity meat-mallet)))
6 : the end of propelling (0) implies proximity
(have
(object meat-mallet)
(attribute (proximity ?in-contact)))
7 : transitiveness of proximity, (3) and (6)
(have
(object chicken)
(attribute (proximity meat-mallet)))
8 : transitiveness of proximity, (4) and (7)
(have
(object chopping-board)
(attribute (proximity meat-mallet)))
9 : transitiveness of proximity. (5) and (8)
(have
(object chopping-board)
(attribute (proximity hands)))
References
Allen, J. F. 1984. Towards a general theory of action
an time. Artificial Intelligence 23:123-154.
Bobick, A. F., and Bolles, R. C. 1992. The representation space paradigm of concurrent evolving object
descriptions. IEEE PAM1 14(2):146-156.
Bobick, A., and Pinhanez, C. 1995. Using approximate models as source of contextual information for
vision processing. In Proc. of the ICCV’95
Workshop
on Context-Based
Vision, 13-21.
Draper, B. A.;
Hanson, A. R.;
experiments in
of road scenes.
standing
Collins, R. T.; Brolio, J.; Griffith, J.;
and Riseman, E. M. 1987. Tools and
the knowledge-directed interpretation
In Proc. of the DARPA Image UnderWorkshop, 178-193.
Kalita, J. K. 1991. Natural Language Control of
Animation
of Task Performance
in a Physical Domain. Ph.D. Dissertation, University of Pennsylvania,
Philadelphia, Pennsylvania.
Marr, D., and Nishihara, H. K. 1978. Representation
and recognition of the spatial organization of threedimensional shapes. In Proc. R. Sot. Lond. B, volume
200, 269-294.
Nagel, H.-H. 1994-5. A vision of ‘vision and language’
comprises action: An example from road traffic. Artificial Intelligence
Review 8:189-214.
Newtson, D.; Engquist, G.; and Bois, J. 1977. The
objective basis of behavior units. Journal 0-f Personsky and Social Psychology 35( 12):847-862:
Rieger III, C. J. 1975. Conceptual memory and inference. In Conceptual Information
Processing. NorthHolland. chapter 5, 157-288.
Schank, R. C. 1975. Conceptual dependency theory. In Conceptual Information
Processing.
NorthHolland. chapter 3, 22-82.
Siskind, J. M. 1994-5. Grounding language in perception. Artificial Intelligence Review 8:371-391.
Strat, T. M., and Fischler, M. A. 1991. Context-based
vision : Recognizing objects using information from
both 2-d and 3-d imagery. IEEE PAMI 13(10):10501065.
Tarr, M. J., and Black, M. J. 1994. A computational and evolutionary perspective of the role of representation in vision. CVGIP: Image Understanding
60( 1):65-73.
Thibadeau, R. 1986. Artificial perception of actions.
Cognitive Science 10:117-149.
Tsotsos, J. K. 1985. Knowledge organization and
its role in representation and interpretation of timevarying data: The ALVEN system. Computational
Intelligence
1:16-32.
Wilks, W. 1975. A preferential, pattern-seeking semantics for natural language inference. Artificzal Intelligence 6( 1):53-74.
Vision 1123
Download