From: AAAI Technical Report SS-02-04. Compilation copyright © 2002, AAAI (www.aaai.org). All rights reserved. A Distributed Control Architecture for Guidinga Vision-based Mobile Robot PantelisElinas,JamesJ. Little Computer Science Dept. Universityof British Columbia Vancouver,BC, CanadaV6T1Z4 (elinas,little) @es.ube.ca Abstract In this paper, we proposea human-robotinterface based on speechand vision. This interface allows a person to interact with the robot as it explores. Duringexplorationthe robot learns about its surroundingenvironmentbuilding a 2D occupancy grid map and a database of more complex visual landmarksand their locations. Such knowledgeenablesthe robotto act intelligently. It is a difficult task for a robot to autonomouslydecide what are the most interesting regionsto visit. Usingthe proposedinterface, a human operator guides the robot duringthis explorationphase. Because of the robot’s insufficient on-boardcomputationalpower,we use a distributed architectureto realize our system. Figure1: Eric, the robot. proved mostly independently of all others. Thesemodules are called behaviorsand each is designedto solve a specific Wewish to implementrobots to help people, especially problem. Often, morethan one behavior must be active at the disabledand elderly, in performing daily tasks. It is naive a time to solve complexproblemsthat a single behaviorcan to assumethat someoneis familiar with the use of computers not solveby itself. or has the time and patience to learn howto use one. Wewant Althoughour longterm goal is to build a general interface to build robots that are easyand intuitive to use. So our focus for robots that can achieve multiple tasks, in this paper we is on developingan interface basedon vision and speech,the present a first systemthat allowsa personto guidethe robot waypeople engagein inter-personal communication. duringexploration. Therobot mustperformrobustly and intelligently in dyThispaperis structuredas follows. Section2 describes namic environments such as those designed and occupied the hardware weuse. Section3 discussesthe hierarchical by humans.It is often challenging to devise algorithms to organization of the behaviors andgivesa brief descriptionof achieve complicated tasks such as speech recognition and eachwith emphasis to the onesmostdirectly involvedwith people tracking, but it is evenharder to makethemrun efhuman-robot interaction. Section4 explains howa person ficiently and without exhaustingthe robot’s computational canget therobot’sattentionanddirectit to a specificareato resources. Sincea robot is usually equippedwitha single on- explore.In section5 wegive an example of a userdirecting board computer,it does not havethe computationalpowerto the robot. Weconclude in section6. run manycomplicatedtasks. A robot control architecture is a systemthat allows a programmer to define the tasks, if and 2 The Robot how they exchange information and which computer they Our work is centered around a Real World Interfaces run on. (RWI)B-14 mobile robot namedEric (see Figure 1). There are manydifferent robot control architectures, is equippedwith an Intel PIII-based computerrunning the Linuxoperating system. Hesenses the surroundingenviron[6, 15]. In recent years, behavior-based architectures, [1, 3], have gained wide popularity for several reasons. One is ment with his numerouson-board sensors. These include the highly parallel nature of the architecture that allowsthe built-in sonar, infrared and tactile (bump)sensors. Eric can see in 3Dusing a Triclops stereo vision system workload to be distributed amongmanyCPU’swithin the sameworkstationor spread over multiple workstationsin lofrom PointGrey Research and in color with a SONY EVIcal area networks.It also allows the modularizationof the (320 camera.Triclops is a real-time trinocular stereo vision system as each task can be developed, debuggedand imsystemthat we have used extensively in our lab for tasks 1 Introduction 11 Figure3: Thefinite state diagramfor the supervisingbehavior. Figure2: Diagramof the robot control architecture. Themiddle-levelbehaviorsfall into twogroups: mobility and human-robot interaction. Thefirst group, mobility, consuch as robot navigation, mappingand localization, [9, 10, sists of those behaviorsthat control Eric’s safe navigation 12]. in a dynamicenvironment.There are behaviorsfor 2DoccuErics’s on-boardcomputercommunicates with the rest of pancygrid mappingusing stereo information,navigation, lothe workstations in our lab using a 10Mbpswireless Compaq calization and exploration. The secondgroup, human-robot modem. interaction, consists of all the behaviorsneededfor Eric to The color camerais necessary for detecting humanskin communicatewith people. Theseinclude speech recognition tone in images, as described in section 4.3, since the Triand synthesis, user head/gesturetracking, soundlocalization clops systemuses gray scale cameras. The color and Triand user finding. Inputs to the middlelevel behaviorsare the clops reference cameraare calibrated so that their images outputs of other behaviorsfromall levels of the hierarchy. match. Currently we do the calibration manuallyas we are Dependingon the input requirements,i.e., if they needacin the process of upgradingto a color Digiclopsstereo syscess to the sensor data, of each middle-levelbehavior, they temthat will eliminatethe needfor a separate color camera. run either on-boardthe robot or on a remoteworkstation. Lastly, we have addedthree microphonesattached to the At the highest level of this hierarchy exists the supertop of the the robot’s body. Twoof the microphonesare vising behavior (also referred to as the supervisor.) used for soundlocalization and the third is used for speech performshigh-level planning. It satisfies Eric’s high-level recognition. The data from each microphoneare wirelessly goals by coordinating the passing of information among transmitted to two workstations. The maximum range of the the middle-level behaviorsand also by selectively activattransmitters is 50 meters. Oneworkstation is dedicated to ing/deactivating the middle-level behaviors as needed. It soundlocalization wherethe received signal is input to a communicateswith all others using UNIXsockets. SoundBlaster Live sound card. The second workstation is Researchof others has focusedon the coordinationof bededicated to speechrecognition. haviors and the extension of a robot’s abilities by dynamically and autonomouslyadding new behaviors. Our sys3 Robot Control Architecture temis an exercise to developingthe behaviorset neededfor natural human-robotinteraction and so we have not added Eric is driven by a behavior-basedcontrol architecture. Theentire systemconsists of a numberof different programs the ability for the supervisor to learn with experience.The supervisor’s structure and abilities are pre-programmed and workingin parallel. Someare alwaysactive producingoutput available across the systemwhile others only run when hencefixed. The supervisingbehavior’sfinite state diagram triggered. Thebehaviorsare hierarchically organizedin low, is shownin Figure3. middleand high levels (Figure 2). Wehave used the samearInteraction chitecture with a small variation on the behaviorset to build 4 Human-Robot an awardwinningrobotic waiter, [4]. For Eric to function in a dynamicenvironment,he must Behaviorsthat needdirect access to the robot’s hardware first gather informationabout the environment’sstatic feaoperate at the lowest level. Thesemodulesare responsible totes. Weplace Eric in this space and let himexplore using for controllingthe robot’s motorsand collecting sensor data. his exploration behavior. Eric decides whichareas to visit Data are madeavailable to other modulesthrough a shared first using an algorithmthat evaluatesall possiblelocations memorymechanismon the on-board computer. This forces basedon the available resourcesi.e., battery powerremainall low-levelbehaviorsto run on the local computer. ing, while trying to maximizeinformation gain. Theseare 12 0.35 ¯ / 03. / 025 . 11 11 / / 02, 0.15" J 01 Figure 4: The geometryfor the soundlocalization problem in 2D using two microphones. The two microphonesare mx and m2separated by distance b. a is the angle of incidence and rx, r2 are the distance of each microphoneto the sound SOUlCe. constraints that are difficult to satisfy and so weprovidean interface for a person to interrupt the robot and guideit to the best location to explore. WhenEric hears his name,he responds by movingout of exploration modeand turning towardthe caller. Eric turns to look at the person and applies his vision algorithm for locating people in images. OnceEric successfully locates the person, she can optionally point to a location whereshe wants Eric to explore. Eric navigates to that location and resumesexploration. Eric locates and tracks a person that he is communicating with. Thereare three middle-levelbehaviorsresponsible for these tasks. Wecontinue to briefly describe each of these behaviors. Thenwe give an exampleof the systemat work. 4.1 Speech Recognition Weuse IBM’sViaVoicespeech recognition engine for interpreting speech heard by Eric. ViaVoiceis programmable to operate with either a speech-and-controlor a dictation grammar.Weuse the former. Wehave specified a simple grammarthat allows about 25 sentences to be spoken. Weprogrammedthe speech recognition behavior using the JAVASpeechAPI (JSAPI) under the Microsoft Windows operating system.This behavioroutputs a uniqueinteger for each sentence in the grammar,whenthat sentence is heard. Theinput is the voice signal receivedover a wireless microphone. For efficiency we havededicated a workstation just for speechrecognition. 4.2 Sound Localization Eric gets an approximateidea of wherea person talking to himis, by localizing on the person’s voice. Soundlocalization is the processof determiningthe location of a sound source using two or moremicrophonesseparated in space. Weuse two microphonesat a distance of 20cm.The microphonesare mountedon top of the robot. The signal captured by each of themis input to a soundcard and both are processedto computethe direction to the soundsource. With two microphoneswe can only determinethe direction on the plane parallel to the floor. Thesoundlocalization algorithm we use is described in [Ill. The geometryfor the problem in 2Dis shownin Figure 4. Since our robot operates in a 3D 13 / / 0 Figure 5: The soundlocalization error function given by Equation2. world, using the simplified 2Dcase for soundlocalization might at first appear inadequate; however,our experiments haveshownthat soundlocalization in this simple form performsadequatelyenoughfor the task at hand. The algorithm computesthe time delay,n, betweenthe signals captured by the two microphones.Thedirection to the soundsource as a functionof this delay is givenby: sin(,,) =en f-b O) wheree is the speedof sound,n is the time delay in samples, f is the samplingfrequencyand b is the distance betweenthe microphones.The range of discernible angles ranges from a = 0 for (n = 0) to a = 90 for (n = 14). The situation is symmetric for a user to the left or right of the robot. Our setup allows for a total of 27 angles in the range [-90, 90] degrees. The localization error can be computedby differentiating Equation1. It is givenby: 6a---- cn ¯ --1 ~nn .... b-~ (2) Figure 5 showsa plot of Equation2. Onenotices that the localization error increases rapidly for angles larger than 70(n = 10) degrees. Theoutput of this behavioris the direction to the sound source. It localizes on all soundswith powerabovea certain environmentalnoise threshold determinedduring a calibration phase at start. Thesupervisor monitorsthe output of the soundlocalization behaviorand it only interrupts the robot (stops exploration) whenthere is a valid sounddirection and the speechrecognition behaviorreturns the nameof the robot. 4.3 People Finding WhenEric hears his nameand has an estimate of the direction his namewas spokenfrom, he turns to look at the template to the component’slocation and size respectively and then we compute: s, =En--1EL-1 g[k,t]f[k,t] /E7=1 E?_-x f2[k,z] Figure 6: A simple modelof a humanincluding the dimensions of the head, shoulders,armsand overall height. user. He mustthen locate the user precisely in the images captured with his two cameras. Westart with the estimate of the user’s head location guidedby the soundlocalization results. If soundlocalization had no error then the user wouldalwaysbe foundin the center of the images.Since this is not the case, weuse our estimate of the localization error to define a region around the center of the imagewhereit is most likely to find the user. This region growsas the localization error growsand it reaches maximum at angles larger than 60 degrees. Wethen proceedto identify candidateface regions in the image. Wesegmentthe color imagesearching for skin tone, [5, 13, 14]. Wetransform the color image from the RedGreen-Blue(RGB)to the Hue-Saturation-Value(HSV)color space. Duringan offline calibration stage we determinethe hue of humanskin by selecting skin colored pixels for several different people. Figure7 (a) and (b) showthe original imageand the skin pixels detected. Wethen apply a 3x3 median filter to removenoisepixels andfill-in the regionsof the face that are not skin color such as the eyes and the mouth. Finally, we perform connectedcomponentanalysis, [7], on the binary imageresulting fromthe skin color segmentation. Onceall the connectedcomponentshave been identified we computea score for each componentbeing a user’s face. Thecomponent with the higher score is selected as the user. This score is a function of three variables: a face template matchcost, an expectedheadsize functional and an expected distance of the personto the robot penalty term. Finally, the entire score is multipliedwiththe value of a linear function with slope that dependson our confidenceto the soundlocalization results. Equation3, belowexpress this moreforreally: <4) whereg denotes the m× n face template and f denotes the mx n face region in the gray scale image. Thesecondterm, Ss, is a function of the size of the component and the expected size at the component’sdistance fromthe robot. It is givenby: S, = 8iZeeffipe©ted iZ~aclual ~iz, ..... , size..,~,.d Size,ctu,t > Size,,pect,d otherwise (5) whereSizeaauatis the area of the componentin pixels and Sizeezpeeted is determined using a modelof a person’s head at the compouent’sdistance. The modelis shownin Figure6. Thelast term, werefer to as the distance penalty and we defineit as: ifz < 3.0mandz > 5.0m D~(i) = ~ c~ × z otherwise t. (6) where a is set to -0.5. So, we penalize a componentthe further awayfromthe camerathat it is. Figure 7 showsan exampleof finding the user. Notice that the user need not be standing. Ourapproachcan distinguish betweenthe face and the two hand componentseven though they havesimilar sizes. 4.4 User Head/Gesture Tracking Oncewe identify the user’s head location in the image, wekeepthis location current by trackingthe headover time. Wealso need to be able to tell wherethe user is pointing for Eric to movethere. In the following two sections, we describe howwe solve these two problems. 4.4.1 Head Tracking Weemploya head tracker very similar to the one presentedby S. Birchfieldin [2]. Birchfield’sis a contour-based s(ci)= Fi x (wl x & +w2 x St +ws x D~(i))(3) tracker with a simple prediction schemeassumingconstant velocity. Thecontour tracked is the ellipse a person’shead whereci is the i-th component,wj is the weightof the j-th traces in the edgeimage,[8]. attribute, Fi is a linear function determinedby the localizaWemakeone addition to the algorithmto increase robusttion error and the location of the component in the image,Sa ness by taking advantageof the stereo informationavailable is the size attribute, St is the matchingcost of the component at no significant computationalcost. Webegin by selecting grayscale imageand a templateof the user’s face and finally the skin coloredpixels in a small region aroundeach hypothDz(i) is a penalty term for components that are too close or esized head location. Then, we computethe meandepth of too far from the camera. The three weights are determined these pixels and weuse this value to depth segmentthe imbeforehandexperimentallyto yield the best results. age using our stereo data. This allows us to robustly remove St is the standard normalizedcross-correlation cost com- the backgroundpixeis. The tracking algorithm can run in puted using the imageand templates of the user’s face. The real-time (30fps) but we limit it to runningat 5fps for comuser providesa full frontal viewof his/her face and images putational reasons especially since our experimentsreveal are captured for use as templates. Wecenter and scale the that this is sufficientlyfast. 14 (a) (b) (c) (d) Figure7: Exampleof finding the user in images.(a) Thecolor image,(b) the skin regions, (c) the connectedcomponents (d) the Triclops imageannotatedwith the location of the user’s headshownwith a square. (a) Co) Figure 8: Exampleof gesture tracking showingtwoof the steps in the algorithm, (a) the selected foregroundpixels in the gray scale imageand(b) the final result showingthe direction selected, and the location of the user’s headand shoulders. 4.4.2 Gesture Recognition The user points to a location on the floor and instructs Eric to movethere. Eric computesthis location and reports the results waitingfor the user’s confirmation. The process is guided by the already knownlocation of the user’s headtaking advantageof color and stereo information along with our modelof the body of a human(see Figure 6). Weproceedas follows: Person: Eric : Person: Eric : 1. Get the gray, color and stereo imagesdenotedIg, Ic and I, respectively. Theseare the input data alongwith the last known location of the user’s head, lhead. 2. Select the skin pixels in the neighborhood of ]head,and generatethe set of skin pixels N. Person: Eric : Person: Eric : Person : Eric : 3. Computethe average depth, dN,of the pixels in N. 4. Segmentthe color and gray imagesusing dNas a guide, Figure8 (a). S. Select the skin pixels in the segmentedimageforming the set of foregroundpixels M. 6. Computethe connected componentsin M. 7. Select the component closest to the last knownheadlocation as the face. . 9. (loudly) Eric! (stops motion and turns towards the person) Find me! I can see you. You are 4 meters away and 1.6 meters high. Is this correct? Yes! Confirmed! (Points with finger) Go there! You are pointing 2 meters ahead and 4 meters to my left. Do you want me to go there? Yes! I am on my way! (Eric moves to that location and enters exploration mode) Thisis onlyoneinstanceof the possibledialogs a user can havewith Eric. At any time during the interaction, the user can simplyins~uct the robot to return to exploration. Additionally, the user mayrequeststatus informationaboutEric’s mechanical status. Eric mightalso fail in one of the tasks of findingand trackingthe user. In that case, Eric reports this and waits for newinstructions. 6 Classify the remainingcomponents as being to the left or right of the face component.Consuuctthe sets L, R respectively. Removethe componentsfrom L, R that are too far or too close to the face component using the modelin Figure 6 as a guide. 10. Fromthe componentsremainingselect the dominantdirection using a heuristic based on the angle that each arm makeswith respect to the person’s body. Define a vector fromthe shoulderto the hand. This is the direction, f, the user is pointing,Figure8 (b). 11. Computethe point on the floor using direction f and return. If Eric fails to detect any hands, mostoften becauseof failure duringthe skin detection, he reports this to the user. 5 Example of HRI Eric can only understand a small numberof spoken sentences. He also has a limited vocabularyi.e., he can only speak a limited numberof sentences. Wehave kept the number of the possible sentences that Eric can recognize and speak to be small in order to speed-up speech processing and also keephis interaction with people a straightforward process. The dialog mostly consists of the user giving commandsto Eric and Eric respondingby confirming the heard command and/or by reporting the results of a computation. In general two-waycommunicationmayproceed as follows: 16 Conclusions In this paper wepresented a distributed systemfor naturai human-robotinteraction. It is designed to performin real-time by distributing the workloadamongseveral workstations connected over a local area network. People can communicate with the robot using speech and hand gestures to direct it to specific locationsto explore. Weare currently workingon improvingthe interaction process by allowing more sentences to be understood and spoken by the robot. Weare also adding to the numberof gestures that can be understoodand finally moresoftwarefor modelingthe user’s emotionalstate by analyzinghis facial expressions. Acknowledgments Special thanks to Jesse Hoeyfor reviewingthis paper and DonMurrayfor his help with programming the robot. References [1] R. C. Arkin.Behavior-Based Robotics.Intelligent Robots and Autonomous Agents.TheMITPress, 1998. [2] S. Birchfield.Anelliptical headtracker. 31stAsilomar Canference on Signals, systems and Computers,pages17101714, November 1997. [3] R. Brooks.A robust layered control systemfor a mobile robot, lEEEJournalof RoboticsandAutomation,2(1):1433, 1986. D. Murray, [4] P. Elinas, .I. Hoey,D.Lahey,D. J. Montgomery, S. Se, andJ. J. Little. Waiting withjose, a vision-based mobile robot. In Proceedings of lEEEInternationalConference on RoboticsandAutomation(ICRA’02), Washington D.C., May2002. [5] B. Funt, K. Barnard, and L. Martin. Is machinecolour constancy good enough. 5th EuropeanConference on Computer Vision, pages 445-459,1998. [6] B. Hayes-Roth.Ablackboardarchitecture for control. Artificial lmelligence, (26):251-321,1985. [7] R. Jain, R. Kasturi, and B. Schunck. MachineVision. MIT Press and McGraw-Hill,Inc., 1995. [8] D. Martand E. C. Hildreth. Theoryof edge detection. Proc. R. Soc. Lond. B, 207:187-217,1980. [9] D. Murrayand C. Jennings. Stereo vision based mappingand navigation for mobile robots. In Proc. 1CRA,Albuquerque NM,1997. [10] D. Murrayand J. Little. Usingreal-time stereo vision for mobile robot navigation. In Proceedingsof the IEEEWorkshop on Perception for Mobile Agents, Santa Barbara, CA,June 1998. [11] G. L. Reid and E. Milios. Active stereo sound localization. Technical Report CS-1999-09,YorkUniversity, 4700 Keele Street North York, Ontario M3JIP3 Canada, December 1999. [12] S. Se, D. Lowe,and J. Little. Vision-basadmobilerobot localization andmapping using scale-invariant features. In Proceedings of the 1EEEInternational Conferenceon Robotics and Automation (1CRA), pages 2051-2058, Seoul, Korea, May2001. and tracking of faces [13] K. SobotLkaand I.Pittas. Segementation in color images. Poroc. of the SecondInternational Conference on AutomaticFace and Gesture Recognition, pages 236-241, 1996. [14] J.-C. Terrillin and S. Akamatsu.Comparativeperformance of different chrominancespaces for color segmentationand detection of humanfaces in complesscene images. Vision Interface, pages 1821-1828,1999. [15] J. K. Tsotsos. Intelligent control for perceptually attentive agents: The S* proposal. Robotics and AutonomousSystems, 5(21):5-21, January 1997. 17