A Distributed Control Architecture for Guiding a Vision-based... Robot

From: AAAI Technical Report SS-02-04. Compilation copyright © 2002, AAAI (www.aaai.org). All rights reserved.
A Distributed Control Architecture for Guidinga Vision-based Mobile
Robot
PantelisElinas,JamesJ. Little
Computer
Science Dept.
Universityof British Columbia
Vancouver,BC, CanadaV6T1Z4
(elinas,little) @es.ube.ca
Abstract
In this paper, we proposea human-robotinterface based
on speechand vision. This interface allows a person to interact with the robot as it explores. Duringexplorationthe
robot learns about its surroundingenvironmentbuilding a
2D occupancy grid map and a database of more complex
visual landmarksand their locations. Such knowledgeenablesthe robotto act intelligently. It is a difficult task for a
robot to autonomouslydecide what are the most interesting
regionsto visit. Usingthe proposedinterface, a human
operator guides the robot duringthis explorationphase. Because
of the robot’s insufficient on-boardcomputationalpower,we
use a distributed architectureto realize our system.
Figure1: Eric, the robot.
proved mostly independently of all others. Thesemodules
are called behaviorsand each is designedto solve a specific
Wewish to implementrobots to help people, especially
problem. Often, morethan one behavior must be active at
the disabledand elderly, in performing
daily tasks. It is naive a time to solve complexproblemsthat a single behaviorcan
to assumethat someoneis familiar with the use of computers not solveby itself.
or has the time and patience to learn howto use one. Wewant
Althoughour longterm goal is to build a general interface
to build robots that are easyand intuitive to use. So our focus for robots that can achieve multiple tasks, in this paper we
is on developingan interface basedon vision and speech,the
present a first systemthat allowsa personto guidethe robot
waypeople engagein inter-personal communication.
duringexploration.
Therobot mustperformrobustly and intelligently in dyThispaperis structuredas follows. Section2 describes
namic environments such as those designed and occupied the hardware
weuse. Section3 discussesthe hierarchical
by humans.It is often challenging to devise algorithms to
organization
of the behaviors
andgivesa brief descriptionof
achieve complicated tasks such as speech recognition and eachwith emphasis
to the onesmostdirectly involvedwith
people tracking, but it is evenharder to makethemrun efhuman-robot
interaction. Section4 explains howa person
ficiently and without exhaustingthe robot’s computational canget therobot’sattentionanddirectit to a specificareato
resources. Sincea robot is usually equippedwitha single on- explore.In section5 wegive an example
of a userdirecting
board computer,it does not havethe computationalpowerto
the robot. Weconclude
in section6.
run manycomplicatedtasks. A robot control architecture is
a systemthat allows a programmer
to define the tasks, if and 2 The Robot
how they exchange information and which computer they
Our work is centered around a Real World Interfaces
run on.
(RWI)B-14 mobile robot namedEric (see Figure 1).
There are manydifferent robot control architectures,
is equippedwith an Intel PIII-based computerrunning the
Linuxoperating system. Hesenses the surroundingenviron[6, 15]. In recent years, behavior-based
architectures, [1, 3],
have gained wide popularity for several reasons. One is
ment with his numerouson-board sensors. These include
the highly parallel nature of the architecture that allowsthe built-in sonar, infrared and tactile (bump)sensors.
Eric can see in 3Dusing a Triclops stereo vision system
workload to be distributed amongmanyCPU’swithin the
sameworkstationor spread over multiple workstationsin lofrom PointGrey Research and in color with a SONY
EVIcal area networks.It also allows the modularizationof the
(320 camera.Triclops is a real-time trinocular stereo vision
system as each task can be developed, debuggedand imsystemthat we have used extensively in our lab for tasks
1
Introduction
11
Figure3: Thefinite state diagramfor the supervisingbehavior.
Figure2: Diagramof the robot control architecture.
Themiddle-levelbehaviorsfall into twogroups: mobility
and human-robot
interaction. Thefirst group, mobility, consuch as robot navigation, mappingand localization, [9, 10,
sists of those behaviorsthat control Eric’s safe navigation
12].
in a dynamicenvironment.There are behaviorsfor 2DoccuErics’s on-boardcomputercommunicates
with the rest of
pancygrid mappingusing stereo information,navigation, lothe workstations in our lab using a 10Mbpswireless Compaq calization and exploration. The secondgroup, human-robot
modem.
interaction, consists of all the behaviorsneededfor Eric to
The color camerais necessary for detecting humanskin
communicatewith people. Theseinclude speech recognition
tone in images, as described in section 4.3, since the Triand synthesis, user head/gesturetracking, soundlocalization
clops systemuses gray scale cameras. The color and Triand user finding. Inputs to the middlelevel behaviorsare the
clops reference cameraare calibrated so that their images outputs of other behaviorsfromall levels of the hierarchy.
match. Currently we do the calibration manuallyas we are
Dependingon the input requirements,i.e., if they needacin the process of upgradingto a color Digiclopsstereo syscess to the sensor data, of each middle-levelbehavior, they
temthat will eliminatethe needfor a separate color camera. run either on-boardthe robot or on a remoteworkstation.
Lastly, we have addedthree microphonesattached to the
At the highest level of this hierarchy exists the supertop of the the robot’s body. Twoof the microphonesare
vising behavior (also referred to as the supervisor.)
used for soundlocalization and the third is used for speech performshigh-level planning. It satisfies Eric’s high-level
recognition. The data from each microphoneare wirelessly
goals by coordinating the passing of information among
transmitted to two workstations. The maximum
range of the
the middle-level behaviorsand also by selectively activattransmitters is 50 meters. Oneworkstation is dedicated to
ing/deactivating the middle-level behaviors as needed. It
soundlocalization wherethe received signal is input to a
communicateswith all others using UNIXsockets.
SoundBlaster Live sound card. The second workstation is
Researchof others has focusedon the coordinationof bededicated to speechrecognition.
haviors and the extension of a robot’s abilities by dynamically and autonomouslyadding new behaviors. Our sys3 Robot Control Architecture
temis an exercise to developingthe behaviorset neededfor
natural human-robotinteraction and so we have not added
Eric is driven by a behavior-basedcontrol architecture.
Theentire systemconsists of a numberof different programs the ability for the supervisor to learn with experience.The
supervisor’s structure and abilities are pre-programmed
and
workingin parallel. Someare alwaysactive producingoutput available across the systemwhile others only run when hencefixed. The supervisingbehavior’sfinite state diagram
triggered. Thebehaviorsare hierarchically organizedin low, is shownin Figure3.
middleand high levels (Figure 2). Wehave used the samearInteraction
chitecture with a small variation on the behaviorset to build 4 Human-Robot
an awardwinningrobotic waiter, [4].
For Eric to function in a dynamicenvironment,he must
Behaviorsthat needdirect access to the robot’s hardware first gather informationabout the environment’sstatic feaoperate at the lowest level. Thesemodulesare responsible
totes. Weplace Eric in this space and let himexplore using
for controllingthe robot’s motorsand collecting sensor data.
his exploration behavior. Eric decides whichareas to visit
Data are madeavailable to other modulesthrough a shared first using an algorithmthat evaluatesall possiblelocations
memorymechanismon the on-board computer. This forces
basedon the available resourcesi.e., battery powerremainall low-levelbehaviorsto run on the local computer.
ing, while trying to maximizeinformation gain. Theseare
12
0.35 ¯
/
03.
/
025 .
11
11
/
/
02,
0.15"
J
01
Figure 4: The geometryfor the soundlocalization problem
in 2D using two microphones. The two microphonesare mx
and m2separated by distance b. a is the angle of incidence
and rx, r2 are the distance of each microphoneto the sound
SOUlCe.
constraints that are difficult to satisfy and so weprovidean
interface for a person to interrupt the robot and guideit to
the best location to explore.
WhenEric hears his name,he responds by movingout of
exploration modeand turning towardthe caller. Eric turns
to look at the person and applies his vision algorithm for
locating people in images. OnceEric successfully locates
the person, she can optionally point to a location whereshe
wants Eric to explore. Eric navigates to that location and
resumesexploration.
Eric locates and tracks a person that he is communicating
with. Thereare three middle-levelbehaviorsresponsible for
these tasks. Wecontinue to briefly describe each of these
behaviors. Thenwe give an exampleof the systemat work.
4.1 Speech Recognition
Weuse IBM’sViaVoicespeech recognition engine for interpreting speech heard by Eric. ViaVoiceis programmable
to operate with either a speech-and-controlor a dictation
grammar.Weuse the former. Wehave specified a simple
grammarthat allows about 25 sentences to be spoken.
Weprogrammedthe speech recognition behavior using
the JAVASpeechAPI (JSAPI) under the Microsoft Windows
operating system.This behavioroutputs a uniqueinteger for
each sentence in the grammar,whenthat sentence is heard.
Theinput is the voice signal receivedover a wireless microphone. For efficiency we havededicated a workstation just
for speechrecognition.
4.2 Sound Localization
Eric gets an approximateidea of wherea person talking
to himis, by localizing on the person’s voice. Soundlocalization is the processof determiningthe location of a sound
source using two or moremicrophonesseparated in space.
Weuse two microphonesat a distance of 20cm.The microphonesare mountedon top of the robot. The signal captured by each of themis input to a soundcard and both are
processedto computethe direction to the soundsource. With
two microphoneswe can only determinethe direction on the
plane parallel to the floor. Thesoundlocalization algorithm
we use is described in [Ill. The geometryfor the problem
in 2Dis shownin Figure 4. Since our robot operates in a 3D
13
/
/
0
Figure 5: The soundlocalization error function given by
Equation2.
world, using the simplified 2Dcase for soundlocalization
might at first appear inadequate; however,our experiments
haveshownthat soundlocalization in this simple form performsadequatelyenoughfor the task at hand. The algorithm
computesthe time delay,n, betweenthe signals captured by
the two microphones.Thedirection to the soundsource as a
functionof this delay is givenby:
sin(,,) =en f-b
O)
wheree is the speedof sound,n is the time delay in samples,
f is the samplingfrequencyand b is the distance betweenthe
microphones.The range of discernible angles ranges from
a = 0 for (n = 0) to a = 90 for (n = 14). The situation
is symmetric
for a user to the left or right of the robot. Our
setup allows for a total of 27 angles in the range [-90, 90]
degrees.
The localization error can be computedby differentiating
Equation1. It is givenby:
6a----
cn
¯ --1
~nn
....
b-~
(2)
Figure 5 showsa plot of Equation2. Onenotices that the
localization error increases rapidly for angles larger than
70(n = 10) degrees.
Theoutput of this behavioris the direction to the sound
source. It localizes on all soundswith powerabovea certain environmentalnoise threshold determinedduring a calibration phase at start. Thesupervisor monitorsthe output
of the soundlocalization behaviorand it only interrupts the
robot (stops exploration) whenthere is a valid sounddirection and the speechrecognition behaviorreturns the nameof
the robot.
4.3 People Finding
WhenEric hears his nameand has an estimate of the direction his namewas spokenfrom, he turns to look at the
template to the component’slocation and size respectively
and then we compute:
s, =En--1EL-1
g[k,t]f[k,t]
/E7=1
E?_-x
f2[k,z]
Figure 6: A simple modelof a humanincluding the dimensions of the head, shoulders,armsand overall height.
user. He mustthen locate the user precisely in the images
captured with his two cameras.
Westart with the estimate of the user’s head location
guidedby the soundlocalization results. If soundlocalization had no error then the user wouldalwaysbe foundin the
center of the images.Since this is not the case, weuse our
estimate of the localization error to define a region around
the center of the imagewhereit is most likely to find the
user. This region growsas the localization error growsand it
reaches maximum
at angles larger than 60 degrees.
Wethen proceedto identify candidateface regions in the
image. Wesegmentthe color imagesearching for skin tone,
[5, 13, 14]. Wetransform the color image from the RedGreen-Blue(RGB)to the Hue-Saturation-Value(HSV)color
space. Duringan offline calibration stage we determinethe
hue of humanskin by selecting skin colored pixels for several different people. Figure7 (a) and (b) showthe original
imageand the skin pixels detected. Wethen apply a 3x3 median filter to removenoisepixels andfill-in the regionsof the
face that are not skin color such as the eyes and the mouth.
Finally, we perform connectedcomponentanalysis, [7], on
the binary imageresulting fromthe skin color segmentation.
Onceall the connectedcomponentshave been identified
we computea score for each componentbeing a user’s face.
Thecomponent
with the higher score is selected as the user.
This score is a function of three variables: a face template
matchcost, an expectedheadsize functional and an expected
distance of the personto the robot penalty term. Finally, the
entire score is multipliedwiththe value of a linear function
with slope that dependson our confidenceto the soundlocalization results. Equation3, belowexpress this moreforreally:
<4)
whereg denotes the m× n face template and f denotes the
mx n face region in the gray scale image.
Thesecondterm, Ss, is a function of the size of the component and the expected size at the component’sdistance
fromthe robot. It is givenby:
S, =
8iZeeffipe©ted
iZ~aclual
~iz, ..... ,
size..,~,.d
Size,ctu,t > Size,,pect,d
otherwise
(5)
whereSizeaauatis the area of the componentin pixels and
Sizeezpeeted is determined using a modelof a person’s
head at the compouent’sdistance. The modelis shownin
Figure6.
Thelast term, werefer to as the distance penalty and we
defineit as:
ifz < 3.0mandz > 5.0m
D~(i) = ~ c~ × z otherwise
t.
(6)
where a is set to -0.5. So, we penalize a componentthe
further awayfromthe camerathat it is.
Figure 7 showsan exampleof finding the user. Notice that
the user need not be standing. Ourapproachcan distinguish
betweenthe face and the two hand componentseven though
they havesimilar sizes.
4.4 User Head/Gesture Tracking
Oncewe identify the user’s head location in the image,
wekeepthis location current by trackingthe headover time.
Wealso need to be able to tell wherethe user is pointing
for Eric to movethere. In the following two sections, we
describe howwe solve these two problems.
4.4.1 Head Tracking
Weemploya head tracker very similar to the one presentedby S. Birchfieldin [2]. Birchfield’sis a contour-based
s(ci)= Fi x (wl x & +w2 x St +ws x D~(i))(3) tracker with a simple prediction schemeassumingconstant
velocity. Thecontour tracked is the ellipse a person’shead
whereci is the i-th component,wj is the weightof the j-th
traces in the edgeimage,[8].
attribute, Fi is a linear function determinedby the localizaWemakeone addition to the algorithmto increase robusttion error and the location of the component
in the image,Sa
ness by taking advantageof the stereo informationavailable
is the size attribute, St is the matchingcost of the component at no significant computationalcost. Webegin by selecting
grayscale imageand a templateof the user’s face and finally
the skin coloredpixels in a small region aroundeach hypothDz(i) is a penalty term for components
that are too close or
esized head location. Then, we computethe meandepth of
too far from the camera. The three weights are determined these pixels and weuse this value to depth segmentthe imbeforehandexperimentallyto yield the best results.
age using our stereo data. This allows us to robustly remove
St is the standard normalizedcross-correlation cost com- the backgroundpixeis. The tracking algorithm can run in
puted using the imageand templates of the user’s face. The real-time (30fps) but we limit it to runningat 5fps for comuser providesa full frontal viewof his/her face and images putational reasons especially since our experimentsreveal
are captured for use as templates. Wecenter and scale the
that this is sufficientlyfast.
14
(a)
(b)
(c)
(d)
Figure7: Exampleof finding the user in images.(a) Thecolor image,(b) the skin regions, (c) the connectedcomponents
(d) the Triclops imageannotatedwith the location of the user’s headshownwith a square.
(a)
Co)
Figure 8: Exampleof gesture tracking showingtwoof the steps in the algorithm, (a) the selected foregroundpixels in the
gray scale imageand(b) the final result showingthe direction selected, and the location of the user’s headand shoulders.
4.4.2 Gesture Recognition
The user points to a location on the floor and instructs
Eric to movethere. Eric computesthis location and reports
the results waitingfor the user’s confirmation.
The process is guided by the already knownlocation of
the user’s headtaking advantageof color and stereo information along with our modelof the body of a human(see
Figure 6). Weproceedas follows:
Person:
Eric :
Person:
Eric
:
1. Get the gray, color and stereo imagesdenotedIg, Ic and
I, respectively. Theseare the input data alongwith the
last known
location of the user’s head, lhead.
2. Select the skin pixels in the neighborhood
of ]head,and
generatethe set of skin pixels N.
Person:
Eric :
Person:
Eric :
Person :
Eric :
3. Computethe average depth, dN,of the pixels in N.
4. Segmentthe color and gray imagesusing dNas a guide,
Figure8 (a).
S. Select the skin pixels in the segmentedimageforming
the set of foregroundpixels M.
6. Computethe connected componentsin M.
7. Select the component
closest to the last knownheadlocation as the face.
.
9.
(loudly) Eric!
(stops motion and turns towards
the person)
Find me!
I can see you. You are 4 meters
away and 1.6 meters high. Is
this correct?
Yes!
Confirmed!
(Points with finger) Go there!
You are pointing 2 meters ahead
and 4 meters to my left. Do
you want me to go there?
Yes!
I am on my way! (Eric moves to
that location and enters
exploration mode)
Thisis onlyoneinstanceof the possibledialogs a user can
havewith Eric. At any time during the interaction, the user
can simplyins~uct the robot to return to exploration. Additionally, the user mayrequeststatus informationaboutEric’s
mechanical
status. Eric mightalso fail in one of the tasks of
findingand trackingthe user. In that case, Eric reports this
and waits for newinstructions.
6
Classify the remainingcomponents
as being to the left
or right of the face component.Consuuctthe sets L, R
respectively.
Removethe componentsfrom L, R that are too far or
too close to the face component
using the modelin Figure 6 as a guide.
10. Fromthe componentsremainingselect the dominantdirection using a heuristic based on the angle that each
arm makeswith respect to the person’s body. Define a
vector fromthe shoulderto the hand. This is the direction, f, the user is pointing,Figure8 (b).
11. Computethe point on the floor using direction f and
return.
If Eric fails to detect any hands, mostoften becauseof
failure duringthe skin detection, he reports this to the user.
5
Example of HRI
Eric can only understand a small numberof spoken sentences. He also has a limited vocabularyi.e., he can only
speak a limited numberof sentences. Wehave kept the number of the possible sentences that Eric can recognize and
speak to be small in order to speed-up speech processing
and also keephis interaction with people a straightforward
process. The dialog mostly consists of the user giving commandsto Eric and Eric respondingby confirming the heard
command
and/or by reporting the results of a computation.
In general two-waycommunicationmayproceed as follows:
16
Conclusions
In this paper wepresented a distributed systemfor naturai human-robotinteraction. It is designed to performin
real-time by distributing the workloadamongseveral workstations connected over a local area network. People can
communicate
with the robot using speech and hand gestures
to direct it to specific locationsto explore.
Weare currently workingon improvingthe interaction
process by allowing more sentences to be understood and
spoken by the robot. Weare also adding to the numberof
gestures that can be understoodand finally moresoftwarefor
modelingthe user’s emotionalstate by analyzinghis facial
expressions.
Acknowledgments
Special thanks to Jesse Hoeyfor reviewingthis paper and
DonMurrayfor his help with programming
the robot.
References
[1] R. C. Arkin.Behavior-Based
Robotics.Intelligent Robots
and Autonomous
Agents.TheMITPress, 1998.
[2] S. Birchfield.Anelliptical headtracker. 31stAsilomar
Canference on Signals, systems and Computers,pages17101714, November
1997.
[3] R. Brooks.A robust layered control systemfor a mobile
robot, lEEEJournalof RoboticsandAutomation,2(1):1433, 1986.
D. Murray,
[4] P. Elinas, .I. Hoey,D.Lahey,D. J. Montgomery,
S. Se, andJ. J. Little. Waiting
withjose, a vision-based
mobile robot. In Proceedings
of lEEEInternationalConference
on RoboticsandAutomation(ICRA’02), Washington
D.C.,
May2002.
[5] B. Funt, K. Barnard, and L. Martin. Is machinecolour constancy good enough. 5th EuropeanConference on Computer
Vision, pages 445-459,1998.
[6] B. Hayes-Roth.Ablackboardarchitecture for control. Artificial lmelligence, (26):251-321,1985.
[7] R. Jain, R. Kasturi, and B. Schunck. MachineVision. MIT
Press and McGraw-Hill,Inc., 1995.
[8] D. Martand E. C. Hildreth. Theoryof edge detection. Proc.
R. Soc. Lond. B, 207:187-217,1980.
[9] D. Murrayand C. Jennings. Stereo vision based mappingand
navigation for mobile robots. In Proc. 1CRA,Albuquerque
NM,1997.
[10] D. Murrayand J. Little. Usingreal-time stereo vision for mobile robot navigation. In Proceedingsof the IEEEWorkshop
on Perception for Mobile Agents, Santa Barbara, CA,June
1998.
[11] G. L. Reid and E. Milios. Active stereo sound localization. Technical Report CS-1999-09,YorkUniversity, 4700
Keele Street North York, Ontario M3JIP3 Canada, December 1999.
[12] S. Se, D. Lowe,and J. Little. Vision-basadmobilerobot localization andmapping
using scale-invariant features. In Proceedings of the 1EEEInternational Conferenceon Robotics
and Automation (1CRA), pages 2051-2058, Seoul, Korea,
May2001.
and tracking of faces
[13] K. SobotLkaand I.Pittas. Segementation
in color images. Poroc. of the SecondInternational Conference on AutomaticFace and Gesture Recognition, pages
236-241, 1996.
[14] J.-C. Terrillin and S. Akamatsu.Comparativeperformance
of different chrominancespaces for color segmentationand
detection of humanfaces in complesscene images. Vision
Interface, pages 1821-1828,1999.
[15] J. K. Tsotsos. Intelligent control for perceptually attentive
agents: The S* proposal. Robotics and AutonomousSystems,
5(21):5-21, January 1997.
17