Learning about a scene using ...

From: AAAI Technical Report FS-93-04. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Learning about a scene using an active vision system
P. Remagnino, M. Bober and J. Kittler
Department of Electronic and Electrical Engineering,
University of Surrey, Guildford GU2 5XH, United Kingdom
Abstract
(colour and motion). Weclaim that a learning process
indispensablefor the systemi) to continuouslyimproveits
performanceand ii) to copewith unexpectedsituations, for
whichpre-compiledknowledge
is inefficient or non-existent.
Wepresent two experiments to show howthe system can
dynamicallyconstruct a symbolicscene modelconveyinginformationabout the 3Dspace occupancy,and learn about new
objects fromthe identification of observedimagefeatures.
Thepurposeof the paper is to introducea systemcapable
of learning the abovethree aspectsby meansof a coordinated
action of knowledgesources. A systemable to solve visual
tasks in a partially knownenvironmentrequires the cooperation of a numberof knowledgesources, namelyfeature
extraction, recognition and motionestimation modules.In
order to solve a visual task the systemperformsa number
of purposiveactions whichdefine its visual behaviour.Actions mayinvolve static explorations of the environmentor
dynamicmonitoringof events occurring in the field of view
of the sensors. Usuallyactions are pre-compiledby encoding
the knowledgeof aa expert about the analysed environment.
Howeverpre-compiledknowledgeis not alwayssufficient to
solve a visual task. Thereis a needfor the systemto learn
about the environmentand to respondeffectively to dynamic
Anactive vision systemcapable of understanding
and learning about a dynamicscene is presented.
The systemis active since it makesa purposive
use of a monocular
sensorto satisfy a set of visual
tasks. The messageconveyedby the paper followsthe thesis that learningis indispensableto the
vision process to detect expectedand unexpected
situations especially whenthe monitoringof scene
dynamicsis employed.In such cases fast and efficient learningstrategies are requiredto counterbalance the tmsatifactory performanceof standard
vision techniques. Thepaper presents twodistinct
waysof learning. The systemcan learn about the
geometryand dynamics
of the scenein the usual active vision sense, by purposivelyconstructingand
continuouslyupdatinga modelof the scene. This
results in an incrementalimprovement
of the performanceof the vision process. The system can
also learn about newobjects, by constructingtheir
modelson the basis of recognisedobject features
and use the modelsto predict unwantedsituations.
Wesuggestthat the vision processbenefits fromthe
use of t~hniquesfor extracting scene characteristics and creating object models.Asa consequence
of the large varietyof existingobject classesa precompiledmodellingof complex-shapedobjects is
unrealistic. Moreover,
it is difficult to predictthe
presence and dynamicsof all objects whichmay
appear in the scene. Evenif the creation of precompiledmodelsfor complexobjects was feasible. the required recognition mechanisms
wouldbe
slowand presumably
inefficient.
1
events.
These mechanismsare incorporated in our system (see
FigureI). Learningabout the sceneis achievedby exploring
the environmentand building a symbolicmodelfor it. This
modelis stored in the systemdatabase and can be re-used
to simplify the execution of newvisual tasks. Evenmore
importantly, our systemcan cope with objects that are unknown.Objects are unknownbecause their appearance is
not predictable at the designstage, or they cannotbe easily
modelleddue to their complexity.
Introduction
This paper presents the experience gained during the design and experimentalevaluation of an active vision system
[VAP89] [RJJJ]. with the emphasison its learning abilities.
By learning we understanda multilevel process, the goals of
whichare multifold.Atthe highestlevel its aimis to construct
a 3Dsymbolicmodelof the environment(learning about the
scene). At the next level the systemlearns about the temporal context of sceneobjects. This informationcan be used to
generateexpectationsabout scene events that will result in
increasedefficiency of scene modelling.At the lowestlevel
the systemis able to learn aboutnewobjects. Objectdescription is renderedin termsof geometricprimitiveswhichcan be
either 3D(generic polyhedra[KYJ93]and cylinders) or
(line segments,curvesand their relationship), and attributes
45
Constructinga modelof the scene is a complextask. The
systemstarts with an explorationof the scenewhereinstances
of known
object classes are identified. Themodelof the scene
is then built in termsof known
objects andtheir 3Dpose. The
systemmaydecide to invoke knowledgesources able to detect and segmentmotionin order to preventobject collisions.
Motionis used for detecting and segmentingunknown
3Dobjects to whichdistinguishingimagefeatures are assigned.At
a later stage, the introduction of unknown
objects improves
the understandingof the sceneand simplifies its maintenance
over time. The conceptof unknown
objects enables the system to see non-modelledobject which wouldbe invisible
otherwise.
Thenext section describesthe systemstructure and its modules. Implementation
details are followedby aa experimental
section, and concludingremarksclose the paper.
2
The Framework
The adopted frameworkuses a mixedcontrol structure. At
the highest level of the architecture decisions and relative
actions are centralised. Acentral controller, the Supervisor
lBItM+93]
[PJJJ931.has the capability to take decision upon
the optimal course of action neededto accomplishuser defined visual tasks. Eachvisual task is decomposed
into basic
perceptual commands
whichdeterminethe portion of visual
space to be analysed, the sensor parametersand the sources
of knowledge
to be used. Thesupervisoradjusts the courseof
action eachtime a relevantevent occursin the field of view.
J
Supervisor
I
i
i
i
i
"l
t
..............................................................................................
i
I’°’’’’°’ I"o, I/ I .o,o. II
sou
,, rc~
~i
[~
SelISOr
Figure1: Thearchitecture of the system
"lhe lowerlevels of the architecture are controlled in a
distributed fashion. Each perceptual command
is executed
and monitoredby separate moduleswhich are responsible
for tight control of the sensor and the run-time parameters
that are used by recognition and motionanalysis processes.
Knowledge
sources that have been employedinclude object
recognition modules,a model-basedobject tracker (which
uses a Kalmanfilter predictor) and an HT(Houghtransform)
based motion extraction module[BK931.
Eachmodulehas a limited specialised knowledgeof the
world whichenables the task execution without any subsequent intervention of the supervisor. For examplethe HT
based tracking modulecan lock on and follow the object
for manyframesafter a single request fromthe supervisor.
Knowledgesources can communicateby accessing a commondatabase as in the blackboard modellEM88]. Figure
1 showsthe architecture of the system. Details concerning
the employedknowledgesources can be found elsewhere
IRem931[BK931[KYJ931[PJ93L
The proposedarchitecture follows designs of frameworks
46
of other researchersin the field of CognitiveSciencelAG89].
Artificial Intelligence [HttW+89]and Vision IGB90J.The
maincharacteristic of such frameworks
is the distinction betweenperception and cognition. Perceptive and cognitive
processesworkclosely first to interpret the sensed data as
perceptual, and then to promoteit to conceptualinformation. Similarly, our workproposesa distinction betweenthe
Supervisormodule,responsiblefor the regulation of the vision courseof action and its contextualsituation assessment,
and a mlmberof independent moduleswhich cooperate to
interpret the viewedscene. TheSupervisoris capableof reasoning; therefore it plays the role of the cognitive module.
Theknowledgesources responsible for the interpretation of
the scene are assembledto form the perceptive module.One
of the novelties of the architecture stemsfromthe wayvisual
tasks are encodedat knowledge
source level. Thesupervisor
defines a set of purposiveactions whichestablish the viewing
parameters;it also establishes whichout of the recognition
and motionanalysis moduleshave to be activated. The existing knowledgesources workin a complementary
fashion,
since each of themhas an independentresponsibility (at the
momentthere are two recognition modulesimplemented,responsiblefor the identification of polyhedraland cylindrical
objects). This simplifies the incrementalconstructionof the
scene performedby analysing object hypothesescreated by
the knowledgesources and combiningnewhypotheses with
the alreadyestablishedones.
Interpretationis identified witha cycle of perceptionwhich
worksat framerate and it includes the sceneimagingand its
3Dgeometricunderstanding.Reasoningis slower, intrinsically asynchronuous
and driven either by events occurringin
the field of viewor by user definedtasks. It entails a continuous adjustmentof the course of action dependingon the
detectedevent or the establishedinterpretation.
Reasoningis representedas a heterarchical structure of
tasks, whereeach user defined task is an independentroot
task. and the terminal tasks identify elementaryvisual operations. Reasoningevolves inside the supervisor in three
majorphases: i) a task handlingphase, responsiblefor task
decomposition,ii) a schedulingphase, dedicatedto task activation, and iii) a responsephase,in chargeof the situation
assessment.
Knowledge
about the world is represented in terms of geometric relations and symbolicconcepts whichdescribe the
scene in terms of objects classes and their attributes. Both
symbolicand geometricrepresentationsare hierarchical. Geometric knowledge
describes the scene in terms of a network
of reference frames, identified with instances of reference
objects whichrepresent best the position of neighbouringobjects. Symbolicknowledgedescribes the scene in terms of
object classes. Classesare definedin termsof the functionality of objects, their attributes and their typical role in a
scene.
Scene dynamicsplays an important role in our system.
The object mobilityfunction, reflecting the likelihood that
the particular instance of object is present in the scene, is
modelledas an exponential function which decreases with
time. A mobilityfactor is definedfor each object to adjust
the speedwith whichthe exponentialfunction decreaseswith
time. Themobilityfactor is object andscene class dependent
and the Supervisormaydecide upona different courseof action accordingto the expectedmobilityof the objects known
to be presentin the analysedscene.
The systemis never idle. In case no user goal has been
definedan initial explorationof the sceneis performed
by the
Supervisor.Explorationis performedby directing the monocular sensor towardsknownreference fi’ames, and exploring
the surroundings. The supervisor has a prior knowledgeof
object classes whichare typically foundin the neiglabourhoodof reference objects. In this waya purposive course
of action can be engagedin by lookingfor instances of such
object classes. Thisgenerally involvesthe invocationof the
most suitable knowledgesources and the selection of a restricted field of viewto focus attentionon particular portions
of 3Dspace. Interpretation is typically guidedto search for
knownobjects in portions of space where the knowledge
about the scene is scarce. Thesensor is directed fromone
reference frameto another one. Searching, monitoringand
all visual operations whichinvolve geometriccomputations
are performedlocally to the active reference flame. This
allowsa considerablereductionof positioningerrors, since
each movement
from one reference to another can be followedby a re-calibration of the sensingdevise ontothe local
referenceobject. Besides, the use of a net of framesallows
a moreefficient exploration of the scene with a consequent
improvement
of the performanceof the vision process.
Learningcapabilities whichhavebeenbuilt in havea different nature. Onone side learning about the scene is accomplishedin terms of knownconcepts. Currently the system has a prior knowledgeof a numberof specific object
classes includingtheir metric information;it can recognise
generic cylinders and box-likestructures; it understandsmotion and the conceptof colour, andit cancharacteriseobjects
by meansof simplegeometricprimitives (straight lines, arcs
and ellipses) and their groupings [BHM+931.
Onthe other
side, learning is also accomplishedby building newmodels
for objects whichare movingin the field of view.
3
Implementation details
Theproposedarchitecture requires efficient processingalgorithras at different levels of abstraction. At the lowestand
intermediate level, it is necessarythat the implementation
routines process the sensed imageand extract salient feature.s. At the highest level, strategies are usedto construct
efficient reasoningmechanisms
able to satisfy a set of visual
goals and regulate their execution.Therequirementsare different at either level: imageprocessingandfeature extraction
needa fast and flexible implementation,
whilereasoning
calls
for a moreelaborate implementation
able to copewith asynchronuousinformation flow and event-driven control flow.
For this purpose imageprocessing, feature extraction and
motionanalysis routines have been implemented
in standard
C, while the supervisor and the control knowledgehas been
implementedin CLIPS[GR89].CLIPS,which stands for C
I_xmguage
Integrated ProductionSystem, is a rule basedsystem implementedin the C language. Consequently.CLIPSis
fast andit allowsfor a simpleandefficient interfacing with
low-levelC routines. CLIPS
inherits all the characteristics of
a rule basedshell. Rulescan be groupedto formefficient and
self-contatined software modules.Moreover,CLIPSversion
5.1 has object-orientedfacilities. Objectclasses, instances
and inheritance can be defined and generated at run-time.
"Ihe pre.sent solutionmaynot be the mostefficient, but it is
47
a goodstarting point whichprovedto workin the presented
system.
4
Experimental work
Bothexperimentsthat weare goingto showhavebeencarried
out with real data. Theyemphasize
the learningskills of our
system.
Thefirst experiment
illustrates learningovertimeat a scene
level. Figure2 (Learningabouta scene) presentsa table top
with two cups, a box and a plate. Thesystemstarts to performa standardexploration of the table top and identifies
onlyone of the twocupsandthe plate (the first figure correspondsto Frame2). The secondfigure showsa later frame
(Frame4) where the visual behaviour has been changed
focus attention onto the identified cup. Whenfocus of attention is activated the systemperformanceis considerably
improvedsince the employedknowledgesources only work
on the region of interest set by the supervisor. Thesecond
figure showsthat the plate instance is still supportedsince
its mobilityhas beenset to lowvalues(this results in a long
exponential decay). The third figure showsFrame10. Focus
of attention has beenchangedto the plate, whichin the meantimehas beenslightly moved.Dueto its lowmobility,the cup
instanceis still supportedin the databaseevenif no longer
visible in the field of view.Frame16 is shownin the last figure. The cup instance has been removedfromthe database.
Thefigure showsthat the plate has beensuccessfullytracked
by the interpretation module.
Thepurposeof the secondexperimentis to illustrate how
the system can learn about the environmentand cope with
unknown
objects. The sequencein Figure 2 (Learningabout
objects) showsa table with a stationary Cokecan and two
movingmechanicaltoys (bugs). The initial goal of the supervisoris to explorethe sceneand learn aboutits contents.
Duringthis phasethe Cokecan is recognisedby the knowledgesourcededicatedto performthe recognitionof cylindrical objects. This is achievedby reasoningaboutthe findings
of lowlevel routines suchas edgedetection, andline andellipse finders [PJ93]. Oncethe Cokecan has beenrecognised
it is entered and maintainedin the 3Dscenemodeldatabase.
Whenmotionis detected in the scene the supervisor switches
its visualgoal so as to checkfor possiblecollisions of moving
objects with the can. To achieveit the supervisor activates
the knowledgesources responsible for 3Dmotionmodelling.
Twomovingregions are detected and their velocities and
boundingboxesare determined(see Figure 2).
Thereis no modelof bugsin the database, thus the moving
objects cannot be recognised. At this stage the supervisor
uses its capability to learn about the objects. This can be
accomplished
by collecting a set of features detected in the
regions of interest. In the current implementation
oneof the
features used to characterise a newobject is its prominent
colour. Other features, including shape and size are also
used. Webelievethat the processof object description could
be still improvedif morefeatures were employed.Themoving objects (the bugs) are nowlabelled and corresponding
features are placed in the database. This implies that they
are nowtreated by the systemas any other object that has
been modelled.Oursystemhas therefore learned to recognised twonewobjects, althoughat this stage it is unableto
interpret themsemantically.
The supervisor follows the movingobjects and makesan
attempt to determinehowlikely a collision with the Coke
can is. The acquired 3Dknowledgeof the scene is used to
mapa 2Dmotioninto 3D motionof the bugs on the table.
Thesupervisorjudgesthe dangerof the collision on the basis
of object 3Dposition and the direction of motion.In this
particular experimentthe danger is detected by comparing
the angle betweenthe collision route of a movingobject and
the actual motiondirection against a predefmedthreshold.
A white frame around the bug movingtowards the can of
Cokesignals the dangerof collision. Theother bugis still
monitored
in case it changesits direction of motion.Whenthe
collision takes place the supervisormarksthe colliding object
(in our case the bugis identified by its colour)as potentially
dangerous.
5
Conclusion and Future Work
Wehavepresentedan active vision systemable to learn about
the scene content and makeuse of its findings to cope with
unexpectedsituations. Thesystemis characterisedby a distributed control structure wherea set of knowledgesources
are coordinatedby a supervisor to solve givenvisual tasks.
Thesystemis neveridle: if no event occursthe field of view
is explored and the knowledgeabout the scene improved.
If an object of interest is detectedthe supervisormaydecide
upona course of action focusedon the monitoringof eventual
collisions. Wehavepresented a techniqueuponwhichlearning turns out to be indispensableto achievinga correct and
reliable monitoringof the scene. Wehaveshownthe results
of two of the manyexperiments conductedto demonstrate
our techniqueand the importanceof learningin a situation of
surveillance. Althoughthe frameworkhas been provedvaluable to solvevisualtasks, its learningcapabilitiesare still at
a prototypelevel. Ourintention is to augmentits robustness
andefficiency.
References
lAG891
V.V. Alexandrovand N.D. Gorsky. Expert systems simulating humanvisual perception. International Journalof Pattern Recognitionand
Artificial Intelligence.3(1): 19-28,1989.
[BHM+93] M. Bober, P. Hoad. J. Matas, P. Rema£nino,
J. Kittler. andJ. filing’worth.Controlof perception in an active vision system:Sensingand interpretation. In Workshop
on Intelligent Robotic
Systems 93. pages 258-276,Zakopane.Poland,
July 1993.
[BK931
M. Bober and J. Kittler. A hough transform
based hierarchical algorithm for motionsegmentationandestimation. In V. Cappellini, editor. Proceedingson the 4th InternationalWorkshop on Time-Varying Image Processing and
MovingObject Recognition.Firenze. June 1993.
Elsevier.
[EM88J
Robert Englemore and Tony Morgan. BlackboardSystems.TheInsight Series of Artificial
Intelligence. Addison-Wesley.
Stanford University. 1988.
48
S. Gongand H. Buxton.Control of purposivesmart vision. TechnicalReport561. University
of London.July 1990.
J. Giarratanoand G.Riley. ExpertSystems,Prin[OR89]
ciples and Programming.PWS-KI~T
Series in
ComputerScience. PWS-KENT,
Boston, 1989.
+ 89] B. Hayes-Roth, M. Hewett. R. Washington,
[HHW
R. Hewett,and A. Seiver. Distributing intelligence within an individual. In Les Gasserand
MichaelN. Huhns.editors. Distributed Artificial Intelligence. chapter 15, pages 385-412.
PITMAN,London. 1989.
[KYJ93] ICC.Wong,Y.Cheng.and J.Kittler. Recognition of polyhedralobjects using triangle pair
features. In IEE ProceedingsPart 1: Communications, Speechand Vision, volume140, pages
72-85. February1993.
[PJ93]
P.Hoadand J.Illingworth. Recognition of 3D
cylinders in 2Dimages by top-downmodelimposition. In Proceedingsof ScandinavianConference on ImageAnalysis, volumeII, pages
1137-1144, Tromso, Norway,May25-28 1993.
[PJJJ93] P.Remagnino, J.Matas, J.Illingworth, and
J.Kittler. A sceneinterpretation modulefor an
active vision system. In SPIEConferencon Intelligent Robotsand Computer
Vision XII : Active Vision and 3Dmethods,September1993.
[Rem93] Paolo Remangino.Controlissues in high level
Vision. PhDthesis, University of Surrey, University of Surrey, GuildfordUK.July 1993.
P. Remagnino.J.Illingworth, J.Kittler. and
[RJJJ]
J.Matas. Intentional Control of CameraLook
Direction and ViewPoint in an active Vision
system. Spring-Verlag.to appear in Vision as
Process, published in ESPRITBasic Research
Actionseries.
[VAP891 Vision as process. Technical annex, ESPRITBRA3038. European Commision.1989.
[GB90I
Learning about a scene
Learningabout objects
(a) Frame
(b) Frame
(d) Movingand recognised (e) Movingand recognised
objects in Frame3
objects in Frame15
(g) Motionsegmentation
Frame 3
(f) Motionsegmentation
Frame 15
(c) Frame
(f) Movingand recognised
objects in Frame25
(11) Motionsegmentation
Frame 25
Figure 2: Learningabout a scene and movingobjects
49