From: AAAI Technical Report FS-93-04. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Learning about a scene using an active vision system P. Remagnino, M. Bober and J. Kittler Department of Electronic and Electrical Engineering, University of Surrey, Guildford GU2 5XH, United Kingdom Abstract (colour and motion). Weclaim that a learning process indispensablefor the systemi) to continuouslyimproveits performanceand ii) to copewith unexpectedsituations, for whichpre-compiledknowledge is inefficient or non-existent. Wepresent two experiments to show howthe system can dynamicallyconstruct a symbolicscene modelconveyinginformationabout the 3Dspace occupancy,and learn about new objects fromthe identification of observedimagefeatures. Thepurposeof the paper is to introducea systemcapable of learning the abovethree aspectsby meansof a coordinated action of knowledgesources. A systemable to solve visual tasks in a partially knownenvironmentrequires the cooperation of a numberof knowledgesources, namelyfeature extraction, recognition and motionestimation modules.In order to solve a visual task the systemperformsa number of purposiveactions whichdefine its visual behaviour.Actions mayinvolve static explorations of the environmentor dynamicmonitoringof events occurring in the field of view of the sensors. Usuallyactions are pre-compiledby encoding the knowledgeof aa expert about the analysed environment. Howeverpre-compiledknowledgeis not alwayssufficient to solve a visual task. Thereis a needfor the systemto learn about the environmentand to respondeffectively to dynamic Anactive vision systemcapable of understanding and learning about a dynamicscene is presented. The systemis active since it makesa purposive use of a monocular sensorto satisfy a set of visual tasks. The messageconveyedby the paper followsthe thesis that learningis indispensableto the vision process to detect expectedand unexpected situations especially whenthe monitoringof scene dynamicsis employed.In such cases fast and efficient learningstrategies are requiredto counterbalance the tmsatifactory performanceof standard vision techniques. Thepaper presents twodistinct waysof learning. The systemcan learn about the geometryand dynamics of the scenein the usual active vision sense, by purposivelyconstructingand continuouslyupdatinga modelof the scene. This results in an incrementalimprovement of the performanceof the vision process. The system can also learn about newobjects, by constructingtheir modelson the basis of recognisedobject features and use the modelsto predict unwantedsituations. Wesuggestthat the vision processbenefits fromthe use of t~hniquesfor extracting scene characteristics and creating object models.Asa consequence of the large varietyof existingobject classesa precompiledmodellingof complex-shapedobjects is unrealistic. Moreover, it is difficult to predictthe presence and dynamicsof all objects whichmay appear in the scene. Evenif the creation of precompiledmodelsfor complexobjects was feasible. the required recognition mechanisms wouldbe slowand presumably inefficient. 1 events. These mechanismsare incorporated in our system (see FigureI). Learningabout the sceneis achievedby exploring the environmentand building a symbolicmodelfor it. This modelis stored in the systemdatabase and can be re-used to simplify the execution of newvisual tasks. Evenmore importantly, our systemcan cope with objects that are unknown.Objects are unknownbecause their appearance is not predictable at the designstage, or they cannotbe easily modelleddue to their complexity. Introduction This paper presents the experience gained during the design and experimentalevaluation of an active vision system [VAP89] [RJJJ]. with the emphasison its learning abilities. By learning we understanda multilevel process, the goals of whichare multifold.Atthe highestlevel its aimis to construct a 3Dsymbolicmodelof the environment(learning about the scene). At the next level the systemlearns about the temporal context of sceneobjects. This informationcan be used to generateexpectationsabout scene events that will result in increasedefficiency of scene modelling.At the lowestlevel the systemis able to learn aboutnewobjects. Objectdescription is renderedin termsof geometricprimitiveswhichcan be either 3D(generic polyhedra[KYJ93]and cylinders) or (line segments,curvesand their relationship), and attributes 45 Constructinga modelof the scene is a complextask. The systemstarts with an explorationof the scenewhereinstances of known object classes are identified. Themodelof the scene is then built in termsof known objects andtheir 3Dpose. The systemmaydecide to invoke knowledgesources able to detect and segmentmotionin order to preventobject collisions. Motionis used for detecting and segmentingunknown 3Dobjects to whichdistinguishingimagefeatures are assigned.At a later stage, the introduction of unknown objects improves the understandingof the sceneand simplifies its maintenance over time. The conceptof unknown objects enables the system to see non-modelledobject which wouldbe invisible otherwise. Thenext section describesthe systemstructure and its modules. Implementation details are followedby aa experimental section, and concludingremarksclose the paper. 2 The Framework The adopted frameworkuses a mixedcontrol structure. At the highest level of the architecture decisions and relative actions are centralised. Acentral controller, the Supervisor lBItM+93] [PJJJ931.has the capability to take decision upon the optimal course of action neededto accomplishuser defined visual tasks. Eachvisual task is decomposed into basic perceptual commands whichdeterminethe portion of visual space to be analysed, the sensor parametersand the sources of knowledge to be used. Thesupervisoradjusts the courseof action eachtime a relevantevent occursin the field of view. J Supervisor I i i i i "l t .............................................................................................. i I’°’’’’°’ I"o, I/ I .o,o. II sou ,, rc~ ~i [~ SelISOr Figure1: Thearchitecture of the system "lhe lowerlevels of the architecture are controlled in a distributed fashion. Each perceptual command is executed and monitoredby separate moduleswhich are responsible for tight control of the sensor and the run-time parameters that are used by recognition and motionanalysis processes. Knowledge sources that have been employedinclude object recognition modules,a model-basedobject tracker (which uses a Kalmanfilter predictor) and an HT(Houghtransform) based motion extraction module[BK931. Eachmodulehas a limited specialised knowledgeof the world whichenables the task execution without any subsequent intervention of the supervisor. For examplethe HT based tracking modulecan lock on and follow the object for manyframesafter a single request fromthe supervisor. Knowledgesources can communicateby accessing a commondatabase as in the blackboard modellEM88]. Figure 1 showsthe architecture of the system. Details concerning the employedknowledgesources can be found elsewhere IRem931[BK931[KYJ931[PJ93L The proposedarchitecture follows designs of frameworks 46 of other researchersin the field of CognitiveSciencelAG89]. Artificial Intelligence [HttW+89]and Vision IGB90J.The maincharacteristic of such frameworks is the distinction betweenperception and cognition. Perceptive and cognitive processesworkclosely first to interpret the sensed data as perceptual, and then to promoteit to conceptualinformation. Similarly, our workproposesa distinction betweenthe Supervisormodule,responsiblefor the regulation of the vision courseof action and its contextualsituation assessment, and a mlmberof independent moduleswhich cooperate to interpret the viewedscene. TheSupervisoris capableof reasoning; therefore it plays the role of the cognitive module. Theknowledgesources responsible for the interpretation of the scene are assembledto form the perceptive module.One of the novelties of the architecture stemsfromthe wayvisual tasks are encodedat knowledge source level. Thesupervisor defines a set of purposiveactions whichestablish the viewing parameters;it also establishes whichout of the recognition and motionanalysis moduleshave to be activated. The existing knowledgesources workin a complementary fashion, since each of themhas an independentresponsibility (at the momentthere are two recognition modulesimplemented,responsiblefor the identification of polyhedraland cylindrical objects). This simplifies the incrementalconstructionof the scene performedby analysing object hypothesescreated by the knowledgesources and combiningnewhypotheses with the alreadyestablishedones. Interpretationis identified witha cycle of perceptionwhich worksat framerate and it includes the sceneimagingand its 3Dgeometricunderstanding.Reasoningis slower, intrinsically asynchronuous and driven either by events occurringin the field of viewor by user definedtasks. It entails a continuous adjustmentof the course of action dependingon the detectedevent or the establishedinterpretation. Reasoningis representedas a heterarchical structure of tasks, whereeach user defined task is an independentroot task. and the terminal tasks identify elementaryvisual operations. Reasoningevolves inside the supervisor in three majorphases: i) a task handlingphase, responsiblefor task decomposition,ii) a schedulingphase, dedicatedto task activation, and iii) a responsephase,in chargeof the situation assessment. Knowledge about the world is represented in terms of geometric relations and symbolicconcepts whichdescribe the scene in terms of objects classes and their attributes. Both symbolicand geometricrepresentationsare hierarchical. Geometric knowledge describes the scene in terms of a network of reference frames, identified with instances of reference objects whichrepresent best the position of neighbouringobjects. Symbolicknowledgedescribes the scene in terms of object classes. Classesare definedin termsof the functionality of objects, their attributes and their typical role in a scene. Scene dynamicsplays an important role in our system. The object mobilityfunction, reflecting the likelihood that the particular instance of object is present in the scene, is modelledas an exponential function which decreases with time. A mobilityfactor is definedfor each object to adjust the speedwith whichthe exponentialfunction decreaseswith time. Themobilityfactor is object andscene class dependent and the Supervisormaydecide upona different courseof action accordingto the expectedmobilityof the objects known to be presentin the analysedscene. The systemis never idle. In case no user goal has been definedan initial explorationof the sceneis performed by the Supervisor.Explorationis performedby directing the monocular sensor towardsknownreference fi’ames, and exploring the surroundings. The supervisor has a prior knowledgeof object classes whichare typically foundin the neiglabourhoodof reference objects. In this waya purposive course of action can be engagedin by lookingfor instances of such object classes. Thisgenerally involvesthe invocationof the most suitable knowledgesources and the selection of a restricted field of viewto focus attentionon particular portions of 3Dspace. Interpretation is typically guidedto search for knownobjects in portions of space where the knowledge about the scene is scarce. Thesensor is directed fromone reference frameto another one. Searching, monitoringand all visual operations whichinvolve geometriccomputations are performedlocally to the active reference flame. This allowsa considerablereductionof positioningerrors, since each movement from one reference to another can be followedby a re-calibration of the sensingdevise ontothe local referenceobject. Besides, the use of a net of framesallows a moreefficient exploration of the scene with a consequent improvement of the performanceof the vision process. Learningcapabilities whichhavebeenbuilt in havea different nature. Onone side learning about the scene is accomplishedin terms of knownconcepts. Currently the system has a prior knowledgeof a numberof specific object classes includingtheir metric information;it can recognise generic cylinders and box-likestructures; it understandsmotion and the conceptof colour, andit cancharacteriseobjects by meansof simplegeometricprimitives (straight lines, arcs and ellipses) and their groupings [BHM+931. Onthe other side, learning is also accomplishedby building newmodels for objects whichare movingin the field of view. 3 Implementation details Theproposedarchitecture requires efficient processingalgorithras at different levels of abstraction. At the lowestand intermediate level, it is necessarythat the implementation routines process the sensed imageand extract salient feature.s. At the highest level, strategies are usedto construct efficient reasoningmechanisms able to satisfy a set of visual goals and regulate their execution.Therequirementsare different at either level: imageprocessingandfeature extraction needa fast and flexible implementation, whilereasoning calls for a moreelaborate implementation able to copewith asynchronuousinformation flow and event-driven control flow. For this purpose imageprocessing, feature extraction and motionanalysis routines have been implemented in standard C, while the supervisor and the control knowledgehas been implementedin CLIPS[GR89].CLIPS,which stands for C I_xmguage Integrated ProductionSystem, is a rule basedsystem implementedin the C language. Consequently.CLIPSis fast andit allowsfor a simpleandefficient interfacing with low-levelC routines. CLIPS inherits all the characteristics of a rule basedshell. Rulescan be groupedto formefficient and self-contatined software modules.Moreover,CLIPSversion 5.1 has object-orientedfacilities. Objectclasses, instances and inheritance can be defined and generated at run-time. "Ihe pre.sent solutionmaynot be the mostefficient, but it is 47 a goodstarting point whichprovedto workin the presented system. 4 Experimental work Bothexperimentsthat weare goingto showhavebeencarried out with real data. Theyemphasize the learningskills of our system. Thefirst experiment illustrates learningovertimeat a scene level. Figure2 (Learningabouta scene) presentsa table top with two cups, a box and a plate. Thesystemstarts to performa standardexploration of the table top and identifies onlyone of the twocupsandthe plate (the first figure correspondsto Frame2). The secondfigure showsa later frame (Frame4) where the visual behaviour has been changed focus attention onto the identified cup. Whenfocus of attention is activated the systemperformanceis considerably improvedsince the employedknowledgesources only work on the region of interest set by the supervisor. Thesecond figure showsthat the plate instance is still supportedsince its mobilityhas beenset to lowvalues(this results in a long exponential decay). The third figure showsFrame10. Focus of attention has beenchangedto the plate, whichin the meantimehas beenslightly moved.Dueto its lowmobility,the cup instanceis still supportedin the databaseevenif no longer visible in the field of view.Frame16 is shownin the last figure. The cup instance has been removedfromthe database. Thefigure showsthat the plate has beensuccessfullytracked by the interpretation module. Thepurposeof the secondexperimentis to illustrate how the system can learn about the environmentand cope with unknown objects. The sequencein Figure 2 (Learningabout objects) showsa table with a stationary Cokecan and two movingmechanicaltoys (bugs). The initial goal of the supervisoris to explorethe sceneand learn aboutits contents. Duringthis phasethe Cokecan is recognisedby the knowledgesourcededicatedto performthe recognitionof cylindrical objects. This is achievedby reasoningaboutthe findings of lowlevel routines suchas edgedetection, andline andellipse finders [PJ93]. Oncethe Cokecan has beenrecognised it is entered and maintainedin the 3Dscenemodeldatabase. Whenmotionis detected in the scene the supervisor switches its visualgoal so as to checkfor possiblecollisions of moving objects with the can. To achieveit the supervisor activates the knowledgesources responsible for 3Dmotionmodelling. Twomovingregions are detected and their velocities and boundingboxesare determined(see Figure 2). Thereis no modelof bugsin the database, thus the moving objects cannot be recognised. At this stage the supervisor uses its capability to learn about the objects. This can be accomplished by collecting a set of features detected in the regions of interest. In the current implementation oneof the features used to characterise a newobject is its prominent colour. Other features, including shape and size are also used. Webelievethat the processof object description could be still improvedif morefeatures were employed.Themoving objects (the bugs) are nowlabelled and corresponding features are placed in the database. This implies that they are nowtreated by the systemas any other object that has been modelled.Oursystemhas therefore learned to recognised twonewobjects, althoughat this stage it is unableto interpret themsemantically. The supervisor follows the movingobjects and makesan attempt to determinehowlikely a collision with the Coke can is. The acquired 3Dknowledgeof the scene is used to mapa 2Dmotioninto 3D motionof the bugs on the table. Thesupervisorjudgesthe dangerof the collision on the basis of object 3Dposition and the direction of motion.In this particular experimentthe danger is detected by comparing the angle betweenthe collision route of a movingobject and the actual motiondirection against a predefmedthreshold. A white frame around the bug movingtowards the can of Cokesignals the dangerof collision. Theother bugis still monitored in case it changesits direction of motion.Whenthe collision takes place the supervisormarksthe colliding object (in our case the bugis identified by its colour)as potentially dangerous. 5 Conclusion and Future Work Wehavepresentedan active vision systemable to learn about the scene content and makeuse of its findings to cope with unexpectedsituations. Thesystemis characterisedby a distributed control structure wherea set of knowledgesources are coordinatedby a supervisor to solve givenvisual tasks. Thesystemis neveridle: if no event occursthe field of view is explored and the knowledgeabout the scene improved. If an object of interest is detectedthe supervisormaydecide upona course of action focusedon the monitoringof eventual collisions. Wehavepresented a techniqueuponwhichlearning turns out to be indispensableto achievinga correct and reliable monitoringof the scene. Wehaveshownthe results of two of the manyexperiments conductedto demonstrate our techniqueand the importanceof learningin a situation of surveillance. Althoughthe frameworkhas been provedvaluable to solvevisualtasks, its learningcapabilitiesare still at a prototypelevel. Ourintention is to augmentits robustness andefficiency. References lAG891 V.V. Alexandrovand N.D. Gorsky. Expert systems simulating humanvisual perception. International Journalof Pattern Recognitionand Artificial Intelligence.3(1): 19-28,1989. [BHM+93] M. Bober, P. Hoad. J. Matas, P. Rema£nino, J. Kittler. andJ. filing’worth.Controlof perception in an active vision system:Sensingand interpretation. In Workshop on Intelligent Robotic Systems 93. pages 258-276,Zakopane.Poland, July 1993. [BK931 M. Bober and J. Kittler. A hough transform based hierarchical algorithm for motionsegmentationandestimation. In V. Cappellini, editor. Proceedingson the 4th InternationalWorkshop on Time-Varying Image Processing and MovingObject Recognition.Firenze. June 1993. Elsevier. [EM88J Robert Englemore and Tony Morgan. BlackboardSystems.TheInsight Series of Artificial Intelligence. Addison-Wesley. Stanford University. 1988. 48 S. Gongand H. Buxton.Control of purposivesmart vision. TechnicalReport561. University of London.July 1990. J. Giarratanoand G.Riley. ExpertSystems,Prin[OR89] ciples and Programming.PWS-KI~T Series in ComputerScience. PWS-KENT, Boston, 1989. + 89] B. Hayes-Roth, M. Hewett. R. Washington, [HHW R. Hewett,and A. Seiver. Distributing intelligence within an individual. In Les Gasserand MichaelN. Huhns.editors. Distributed Artificial Intelligence. chapter 15, pages 385-412. PITMAN,London. 1989. [KYJ93] ICC.Wong,Y.Cheng.and J.Kittler. Recognition of polyhedralobjects using triangle pair features. In IEE ProceedingsPart 1: Communications, Speechand Vision, volume140, pages 72-85. February1993. [PJ93] P.Hoadand J.Illingworth. Recognition of 3D cylinders in 2Dimages by top-downmodelimposition. In Proceedingsof ScandinavianConference on ImageAnalysis, volumeII, pages 1137-1144, Tromso, Norway,May25-28 1993. [PJJJ93] P.Remagnino, J.Matas, J.Illingworth, and J.Kittler. A sceneinterpretation modulefor an active vision system. In SPIEConferencon Intelligent Robotsand Computer Vision XII : Active Vision and 3Dmethods,September1993. [Rem93] Paolo Remangino.Controlissues in high level Vision. PhDthesis, University of Surrey, University of Surrey, GuildfordUK.July 1993. P. Remagnino.J.Illingworth, J.Kittler. and [RJJJ] J.Matas. Intentional Control of CameraLook Direction and ViewPoint in an active Vision system. Spring-Verlag.to appear in Vision as Process, published in ESPRITBasic Research Actionseries. [VAP891 Vision as process. Technical annex, ESPRITBRA3038. European Commision.1989. [GB90I Learning about a scene Learningabout objects (a) Frame (b) Frame (d) Movingand recognised (e) Movingand recognised objects in Frame3 objects in Frame15 (g) Motionsegmentation Frame 3 (f) Motionsegmentation Frame 15 (c) Frame (f) Movingand recognised objects in Frame25 (11) Motionsegmentation Frame 25 Figure 2: Learningabout a scene and movingobjects 49