>> Zhengyou Zhang: Okay. So my great pleasure to introduce professor John Tsotsos.
John is a professor at York University in Canada, and he's one of the very few people who really work across many areas in cognitive vision, computational vision and computer vision. And today he will talk about a topic which I think is very interesting combination, computation and computer. Thank you.
>> John K. Tsotsos: Thank you so much. It's a pleasure to be here. So I'm going to talk about basically it's a visual search talk today, and in the -- if you want to categorize it, it's in the active visual domain for a mobile robot. So we'll start off with a little bit of motivation as to why we're looking at the problem and why we're looking at it from this particular angle that you'll see. I'll talk a lot about active object search, and I'll give you some background in work we've done in the past and then a lot of detail on how exactly we're solving the problem currently. I'll show you a number of examples and a couple of experiments that show a little bit of an evaluation of the method all of that will also include comparison to other techniques for doing similar things, and then we'll conclude with basically giving you an update on the overall project because the active search is only part of an overall project.
So let's begin with the motivation. And the motivation initially was we wanted to build a wheelchair for the disabled, an autonomous wheelchair that's visually guided for the disabled. So in those cases the dream is really to provide mobility for those people who don't have mobility or have some restrictions in terms of their motor systems.
Now, it's interesting that a lot of things that we used to were predicted by Gene
Rodenbury [phonetic] in Star Trek, but not this. So if you remember the episode the
Glass Menagerie with Captain Pike, he was in a wheelchair but he was pushed around everywhere. He didn't have his own mobility. So it's curious to me that Gene
Rodenbury didn't foresee the fact that someday these chairs would be autonomous.
The first of the papers to appear in this area is from 1986. So and there's little picture in the corner of what that wheelchair looked like. And basically it was a self navigating chair that had a number of sensors on it and the main goal was mobility. And that's really it. In a crowded building and so forth.
That still remains the major goal of all of the groups that do this sort of research. But since then there have been a lot of different sorts of wheelchairs and I've got just some pictures of a number of them. And you know, you could just briefly take a look at them.
They span many years. And they kind of all look the same except, you know, down in the corner here Dean Kamen's ibot is a little bit different and has some unique elements. A even some of the more modern ones that have, you know, some sexier design. Really all we can say is that even this Suzuki chair looks a lot like what captain
Christopher pike had in Star Trek, and that he was just sitting in it and just provides mobility.
Our motivation is a little bit different. And it really started when I was watching television on a Saturday morning in 1991. I was babysitting my son who was 18 months old at the time and I was looking at the Canadian version of your PBS network. And there was featuring a robotocist, Gary Squire, Gary Birch, rather, from the Neil Squire Foundation.
And I didn't know him, so I thought let me watch. And my son was playing happily in front of -- you know, right below the television.
So what he had shown is a system that he had developed for a disabled children to be able to play with toys. So imagine the following sort of scenario. There was a little boy in his wheelchair wearing a bicycle helmet. In front of him was a table, and the table had a number of toys on it. Above the table was a began entry robot with arm and manipulator. The arm's joints were color coded. Beside the child's head was a paddle wheel where the paddles were color coded the same colors as the joints.
So now what the child would do is with his head hit paddles of the right color and every time he would hit the given color, one of the joints would move one step. So after doing this a lot of times, he managed to move the began entry robot and arm down to grasp a toy, a single toy.
So I was watching this and my healthy son was playing on the floor below, and you know, it kind of you know got to me saying you know and I thought this is my area, we should be around to do better than this. So that's really when the project started back then. It's been -- it was initially easy to fund because people thought wow, this is great, a lot of social and economic value. And then as we got more developed it was difficult to final because people thought, well who's going to pay for such a thing? It's really expensive and stuff like that. So the funding part has always been really difficult. And that's why it's taken a long time to make a lot of progress.
So our initial focus was not on navigation, it was really on how to ease the task of instructing the robot and having a robot deal with a search and grasp kind of a task. In other words, to short circuit this tedium that the child that I saw in that wheelchair has to face in order to grasp something. If we could only find an easier way to instruct the robot and then have the robot take over and visually final everything and do all the planning and so forth it would have been easier.
So our focus is very different than most groups that focus only on navigation. Add in the fact that my lap has traditionally looked at vision only, we wind up with the purely vision based robot. So there are no sonar, no laser on it and so forth. And I'll show you that towards the end of the talk.
Obviously this leads us to an active approach because this robot has to move around.
So it's not -- we're not looking at static images and we need to be able to control acquisitions. So if you're faced with a task of how do I find something in a room like a toy, it's an active approach. So there's lots of reasons why active approaches are valuable, and I just give a big long list of them here, things that I have written in my papers over the past and hopefully trying to convince you that there is really a lot of room for active vision in the world that is different from the kinds of, you know, single image static approaches that one sees most commonly in computer vision.
The work we've done in the past in active vision is pretty interesting in my mind, and it started with a PhD student of mine named David Wilkes. I'll show you an example of his work. It then moved to some work that I did when Sven Dickinson was my post-doc at the University of Toronto and with Henrik Christensen who is at KTH now in Georgia
Tech, and then with another PhD student Yiming Ye and a master student Sonia
Sabena [phonetic] which and that will be the focus of today's talk.
But I'll just give you these first couple of examples. So this is from Sven Dickinson's work. And basically those of you who know what Sven has been doing, it's been focusing on object recognition over the years. And he initially looked at having an aspect graph representation of objects which is shown up toward -- on the left side.
And he thought, when he was working with me, that suppose we use the links on the aspect graph to encode information about which directions to move the camera in order to acquire different views of the object.
So in particular, if you have a degenerate view such as this one here of the cube, and you try to match it to an aspect graph representation, and you match a particular face on that aspect graph, looking at the links might tell you in which directions to move the camera to disambiguate a cube from an elongated device, structure, okay? So here is the coding that you see. I'm not going to go through it that detail of coding. It's not interesting. Suffice it to say that for each of the links there would be some information that would give you the next viewpoint and sure enough then you can move to the least ambiguous aspect of them all just by looking at the different links and comparing them in order to do the disambiguation. Sure?
>>: I guess [inaudible].
>> John K. Tsotsos: Yes?
>>: You mentioned that the, you know, the reason for doing all these is to help people
[inaudible].
>> John K. Tsotsos: Yes.
>>: I'm just kind of wondering how you expect [inaudible] to be able to specify what he or she wants.
>> John K. Tsotsos: Okay. I'll answer -- okay. I'll answer that now so that everybody
-- if anyone has the same question. So what we have right now on our wheelchair is there's a small touch pad on it, touch sensitive screen, and we're assuming that the user is able to at least move their finger to touch things. So the kind of patient that I was thinking of is perhaps a cerebral palsy child who is accustomed to communicating using the bliss language.
The bliss language is a symbolic language and typically like, you know, on a table in front of them, they have a board with a number of symbols, usually paper or plastic symbols, and they point, they have, you know, course control over their arm, and they point to these symbols in a sequence in order to communicate. So our touch sensitive screen is an extension of this bliss language. It includes the bliss symbols but also includes pictures of the toys, little video clips of actions, pictures of colors and so forth.
So that you can touch a sequence of those and basically create a short sentence with five or six touches that gives the instruction to the robot of what you want. Pick up block, put down here or something like that. Does that answer? Okay.
The second bit of work I wanted to show you is what David Wilkes had done, and his project was to try and recognize origami objects out of a jumble of them, so you would take a bunch of these origami objects and sort of drop them on a table and you want to
be able to recognize them. And he had created an internal representation of these that was basically wire framed but where you had the faces of these objects specially labeled to be faces that you could use to disambiguate one object from another. So if you think of it as a hypothesizing test kind of a framework of trying to index into a database of objects you would extract information about a face of an object, so in other words, one of those wings, let's say, and use that as an index into the space of objects, come up with the number -- the number of hits that will be present there and then use other faces that are represented in each object as hypothesis to go test to help disambiguate amongst that smaller set and then you continue that loop until you come down to a single one.
So this is a little video of the system doing exactly that. So I'll remind you, this is 1994, when this was created, and it's on a TRW platform with a CRS robot arm. The gripper has a camera and light source in it. And the kinds of motions that you see it executing are all control that's a behavior based system of three different kinds of motions. So it turns out that you have sufficient number of motions to do this if you allow for motion on the semisphere that where the objects are at the center. So you can go radial into that semisphere, you can go tangentially, or you can rotate.
And those three motions, if structured in the right way, are sufficient to be able to capture the faces that you need to recognize and to move to other views of objects in order to be able to do that disambiguation. It also is able to recognize when it reaches the limit of its reach so it can move the platform in order to extend the reach, in order to go around that whole semi sphere over the object. Uh-huh?
>>: [inaudible].
>> John K. Tsotsos: Yes. So it's a word that appears in medicine. And basically it refers to a symptom or a sign which makes it clear that it's a particular disease. So here you want -- so in other words, if you find this, then it's definitely that disease. Okay? So the search here is for viewpoints that tell you for certain that it's a particular object. So in other words, you have all possible viewpoints when you just toss the objects down in a jumble, they could be in any kind of pose, but you're looking for the camera pose, camera object viewpoint that will allow you to determine that it's a particular object with certainty.
>>: So you're basically looking for unique view of the object?
>> John K. Tsotsos: It's not necessarily unique. It could -- there could be overlaps.
But you're looking for unique view that would distinguish one object from another in that hypothesize and test framework.
>>: [inaudible].
>> John K. Tsotsos: Yes. Yes, it does that. So there's some past work that we've done on which we've built, and I just have to say this, this is kind of you know more political than anything, trying to sell you on the idea of attention or active vision just by pointing out that an awful lot of work that one sees in computer vision these days tries to set up circumstances so that one doesn't need to have attention or active vision. And I think that when you're looking at kind of really world applications most of the
assumptions are not valid. So things like always having only fixed cameras in our application where we have a mobile platform, it's not a valid thing. Taking images out of spatial temporal context so we don't need to track again for us, it's not a valid assumption and so forth.
So there are a lot of situations where most of the current assumptions that people make
I think are not valid. So let's move on to the main meat of the talk which is looking at 3D search, and this was the PhD thesis initially of my student Yiming Ye who I dedicate this to because he passed away a few years ago. So let me formulate the problem for you first. The problem that we're looking at is how to select a sequence of actions on the robot that will maximize the probability that that robot that has active sensing will find a giving object in a partially unknown 3D environment within a given set of resources.
Okay? So that's our definition of the search problem. And basically to formalize it further what we're looking for is the set of actions out of the possible full set of all of the actions that satisfies two constraints. One being that the total time that it takes is below some threshold. So in other words, we're not going to wait for ever, we're going to have a limit on the amount of time. And that maximizes the probability of finding that particular object that we're looking for.
Under this kind of a definition, which is quite general, I think, the problem is provably NP hard. And the Yiming had done those proofs and you can see them in some of the older papers. So given that a problem is NP hard, you know you're not going to have an optimal solution so one is looking for approximate solutions in order to get as close as possible to that optimal solution. So we're going to start looking at some heuristics that will help us in dealing with the exponential behavior of the generic solution.
So the first -- sorry. So there's a number of other variables that are of interest, and that is that we have to deal with the fact that the robot has an XY location and an orientation on the plane. We have cameras, and they pan and tilt, and I have a depth of field. The room has a certain size, so this is all we know about whatever space we're searching.
The length, width, and the height of the room, we only know where the external walls are. And the kinds of observations that we want this robot to make are two only. One is they're an obstacle at a given position, and second is the target that I'm looking for in a given position. So those are the variables.
What we do first is think about how we can partition the room in such a way and partition the space of actions as well in order to make it reasonable. So one doesn't have to look at all possible locations of the robot, all possible angles of view and so forth. So we're going to have an occupancy grid representation, so we're going to fill this room with little cubes and in each of those little cubes we'll have a couple of variables represented. One is the probability that the object is at that -- centered at that cube, and the other is whether or not that cube is solid in the world. Okay? And we then will partition all the possible viewing directions. So rather than having the robot need to consider every possible angle degree by degree, we'll partition it into certain viewing angles to make that a smaller number. That means that what we have is if I'm the robot and I -- and I'm looking around in the world, I've got kind of these wedges that
I see in the world, these three dimensional wedges that I see out into the world and that creates kind of the structure that we call the sense sphere. So if you look on the outside of that sphere, everything is at a particular depth of field and it's this little wedge, so it
has all of those images that it can look at. So it's not -- so it reduces the number of those images tremendously.
Because there's a depth of field, and I'll explain this a little bit more later on, it really is an onion skin. So you're not looking at all possible depths, you're looking only at a particular depth of field that the camera uses. Now, why is that relevant? Because for every object that you want to see, it's not the case that a robot will be able to recognize it from any position in the room.
If it's too far away, there aren't enough pixels, and if it's too close, it's too big. So you can't recognize it. So you needed to have it within a particular depth of field which is tuned for every object that you would like to recognize. Sort of you know, dynamically.
So we've restructured the representation of the world and then we're going to associate with each one of those actions the probability that we've actually found that object. So that action depends on the robots position, the current sensor settings and for any set of actions, a set of actions, the probability of detecting the target is given by this, which is kind of standard Bayesian sort of framework. The conditional probability of detecting the target, given that its center is given by B and this is determined at this point entirely empirically. And that's one of the dimensions of future work is to get that to be learned specifically for each particular kind of object.
So it depends, that conditional probability depends on XY location relative to the location of the cube that we're particularly testing, the action that's chosen, the viewing direction and so forth. And as I said, this is determined experimentally for particular objects and future work is going to look at how one learns it in the more generic case.
The reason why we didn't focus on that, just like we didn't focus on having fancy path planning for the robot and so forth is because we wanted to focus on the actual search method. How do you determine where to look? So we wanted to test that part without spending a lot of time on the balance. Once that part is working, we have the default situation, we go back and start adding some of these other components. This is something that I kind of worried about for a while and about a year ago [inaudible] visited, and I showed him the demos here, and I explained to him that, well, we don't have this and we don't have that because we wanted to focus on the active part. And he thought for a second and goes, yeah, you're right, I think I agree with you. A lot of the groups that focus on path planning get stuck at path planning and never leave path planning to do all the harder stuff. So you've gone that part first. So I felt kind of better about that. But still it's a weakness of the overall system.
What we have is a gritty strategy. We divide the selection of these actions into where to look next and where to move next stages. So the actions that we can do is move the robot in XY along the plane in the room. We can -- another action is moving the camera so where can you move the camera to? And the third action is actually taking some measurements. So we first say where do we look first, and then where to move first.
And there's no look ahead in this system. So where to look next algorithm looks something like this.
In the default case, let's say we're in this room, the room is divided into all of these subcubes. Just for reference when we did our experiments the cubes are five centimeters by five centimeters by five centers in our examples that you'll see later.
Each one is assigned a probability that the object is present there, given we have no information about the object at the moment nor its location, everything is equal probable. All of it sums to one.
So first of all, we calculate some total probability of a region in space that we were hypothesising to look at. If everything is equal probable, then everything, every direction is the same. Looks the same. So you just choose one. You look in that particular direction, so you have that wedge if you remember. You can test whether or not the object is present there. If it's not present, you can set those cubes to probability zero and update everything else. So after you do that a few times, you run out of places to look from a particular position. So I look here, I can look here, I can look here, I can look here and then I run out of potential options for where to look if I can't find the object. And then I would decide that well, maybe I need to move to some other object.
And when you see this working, it's kind of like the way a little kid would do it. Because if you asked a little kid to find something in the room, they sort of walk around and they'll, you know, do this and you know, walk around some more and do this and so forth and that's exactly how the system behaves, also.
Where to move next is then determined in a different way. So if I'm standing here and
I'm exhausted all my viewing possibilities from this location, I'm now going to ask myself well where would be the next best location for me to go. We don't have any kind of path planning here, so right now what the robot would do is just look for free floor space. If you had path planning, you don't have that restriction. So you would hypothesize.
Suppose I move over there. If I'm there, what would -- what is the set of viewing angles that I have available to me at that position? And then what is the sum of all the probabilities of the cubes that I have access to at that position? And I compute that for all of my accessible positions and choose the one where the sum of the total probabilities is largest. Well, that would be the one where I have the strongest probability that I would find my object. And I move there, and then I go through the whole process again. I look at all the viewing angles from that direction and then repeat until I find the object. Yes?
>>: Actually [inaudible] for example you know, you see something close to you and so I go make a conscious effort to move sideways [inaudible].
>> John K. Tsotsos: No. At the moment there is no planning of that sort of -- no path planning of that sort of thing nor is there viewpoint planning of that sort. The assumption that's made in the examples that I'll show you is that there is a free floor position in the room from which I will see one of the views of the object that I know.
That's a limitation, clearly. Adding in planning removes that.
But we didn't do that because we wanted to make sure that we were able to handle, you know, this sort of case where you know planning didn't exist even. So that's kind of the default situation, but you're absolutely right.
>>: [inaudible].
>> John K. Tsotsos: Yeah. You'll see that clearly in the examples. So you can easily imagine that as you move about the room all those probabilities are changing all the time. And it actually initially looks pretty simple but by the end of a -- you know, two or
three positions it's actually pretty complex. Very difficult to characterize in any formal way.
>>: [inaudible]. Just starting, so that's like uniform [inaudible].
>> John K. Tsotsos: Yes.
>>: And then it's evolving as --
>> John K. Tsotsos: Correct. Correct. So we've actually taken this and implemented it three separate times. So the first implementation was what Yiming had done for his
PhD thesis on the cyber motion platform. This was mid 1990s. And the whole algorithm worked well. Different sorts of sensors. And from back then we have no movies, we didn't have tech pictures, people just didn't do that back then. So I don't have very much examples.
The examples that you'll see running are in the pioneer three platform, the Sonia
Sabena implemented all of this. And it uses the point gray bumblebee their standard stereo software and a pan tilt unit from directive perception. So that's all.
This is a wheelchair. It also has reported to the wheelchair. Also runs on the wheelchair. It can go find an object. It even finds doors and moves the door, opens the door and passes through it autonomously, all of that. So this is the touch sensitive display that I was referring to earlier on which -- with which users can be able to point their instructions.
One thing that I didn't say is how exactly to determine whether or not you've found the object. So you're making observations. I said two observations. One is it's -- is the object solid? That's where stereo comes in. If you get a stereo hit from the tryclops algorithm, then that is filled. And the other is whether or not the target that you're looking for is at that position. There's lots of different kinds of object detection algorithms. We defined our own which works actually pretty well for the kinds of objects we were initially looking at. Namely solid colored objects, like blocks, kid's blocks, that sort of thing. So it actually works very well. And it's something that James MacLean did when he was a post-doc in my lab.
And it's based on the selective tuning algorithm of attention that I've been working on forever. And it's basically pyramid tune to objects and uses gradient descent to locate things within a pyramid representation. It works pretty well. It has -- it can handle objects as long as they have distinguishing 2D surface. It seems to be pretty rotation and the plane and scale invariant, handles some rotation in depth as well, and uses a sport vector machines in order to determine some acceptance thresholds. The search algorithm is not dependent on this. And as you'll see, one of the other experiments that we did, we used sift features in order to detect an object, because this method doesn't work well for textured objects. Well, sift didn't work well for solid objects, so we use that. And basically the algorithm, the overall algorithm could have a family of different detection methods and just choose which is the right one to use for the particular object that you're looking for.
So this is necessary but it's not -- it didn't really define the performance of the method.
Also what we haven't done and which is future work is we need to worry a little bit more about that relaxing that assumption that I mentioned earlier, namely that we're assuming that there is some spot on the floor where the robot will find a view of the object that it recognizes. So we need to relax that because that's not a general -- that's not generally true in the real world. So we need to be able to include viewpoints of image sizes and scale and rotation in 3D and so on and so forth. One way we could to that a little bit is by simply doing some experimental work and seeing how does the recognizer that we have function when there is rotation in depth or occlusion? So these are just a couple of examples of the kind of recognition performance that you get as you increase the percentage of the target being occluded or as you increase the degrees of depth and rotation.
And this is actually not so bad because you see for occlusion it's not very tolerant, for rotation an depth it does a reasonable job up to about 20 degrees. So this gives you at least a little bit of flexibility. But it's certainly not ideal. And there's just some other examples of this. And we actually can build this in to the system in terms of the tolerance that we're willing to take, you know, whether it be here or here, in order to decide which one of the recognizers to use.
In fact, what we need to do is have a recognizer that will be able to deal with the full viewing sphere around the particular object. So we can have a little bit of buffer zone due to the performance of it as I just showed you, but in general we need to be able to determine that I've actually looked at a particular position from all possible angles that would let me find the object before deciding that I have not found the object. As opposed to what I told you, namely I took this view, I ran my recognizer, I didn't fine it, I blanked everything out. That assumption also has to be relaxed. This is future work.
We can put in a priori search knowledge because we also can relax the assumption that everything is equal probable. So there are a number of different things that we can do here. Type one a priori is exactly that, everything is equal probable. We can include indirect search knowledge. This is an idea from Tom Garvey from 1976. Lambert
Wickson [phonetic] when he was doing his PhD with Dana Ballard had implemented a system that had this indirect search. Basically if you can find an intermediate target object more easily than one you found you do that first and then find the target object.
So if you're looking for a pen and that's hard to find, pens are often on desks you find the desk first and then go find the pen.
You could highlight regions in which to try first. You know, I think I left my keys over there. So you just put that into the system and have the -- prefer that position first. You can add saliency knowledge. I'm looking for a red thing and I'm looking for this particular image. I can't find the object I want but there's a red blob over there. So maybe for the next image I want to go closer and inspect that particular area. Or I could have predictions in there about spatial structure or temporal structure and these again are older ideas, one from Kelly, one from my own work.
We can't make simplifying assumptions about probability distributions, in other words.
So things can be complex if you have some of these a priori search cues. So let me show you some examples of all of this now, just to make it a little bit more concrete.
The simplifications that are involved in this example is that we start off with a uniform
initial probability distribution function. I'll show you examples afterwards where that is changed. We limit the tilt on the cameras to 30 degrees. We have no focal length control for the bumblebee camera, of course.
We don't have the location probabilities being viewpoint dependent, as I showed you.
We're assuming you have a recognizable face of the object somewhere from the visible free space and there's no path planner. So the room in which this is operating is the lab here. The robot will start in this position facing forward. The target is that object back here. Okay? So basically if I'm the robot is there, the target is behind me.
The total possible number of robot positions is 32, because we're doing this based on one meter by one meter tiles on the floor. The total possible camera directions from any position is 17. With a -- the field of view that's specified there and the depth of field for this particular object being half a meter to 2 and a half meters. So from wherever the camera is, half a meter to two and a half matters is the region in which the recognizer we have can recognize this particular object. You'll see I'll show you a couple of examples of where that changes for different objects.
The total size of the action set to choose from at any step is 544 actions. So you need to choose from amongst those and the total number of occupancy grid positions is 500
-- 451,200 each encoding two values. So if you're going to ask questions like what's the total number of possible states, it's a really big number, and I didn't bother to calculate it. So it's not something that it's an easy system in order to deal with that number of states.
So here's the way the example will work. In this view, this little rectangle is the robot.
This is the field of -- the depth of field and the direction in which it's looking. The zero here corresponds to the fact that it's horizontal, the camera the horizontal. These little rectangles are for your use only. The robot does not know the positions of any of the objects. It only knows the positions of the exterior walls. So these are just for your use in order to be concrete with the system. And the target is over here.
So with this image, this image shows you the probability of a target presence and the horizontal plane. So in other words, the probabilities of all of the cubes in the horizontal plane only. Because it's too hard to show all the planes. And these are the stereo hits where the stereo algorithm tells you there are obstacles. Black refers to I've looked there and I've set those probabilities to zero. The gray scale here shows you the resulting probability changes. And this is just an imagine of where the robot actually is.
So you can see it as it's going.
So this is position one. And the sensing action one. The second sensing action looks at this other angle. So you see now that probability changes. The third you see the background gets lighter because it gets higher. And the fourth, so it does all of those, and then the algorithm decides well, there isn't any other particular view that's going to be useful to me. In the sense that there is a threshold that has to exceed in terms of utility of a particular direction of view. So then it decides it's time to move.
So as I said, the time to move is determined by hypothesizing that I'm in a particular other position and looking at what the sum of the probabilities are at that other position that I could see. So this is the map that gives you those hypotheses as each one of
those actions that I just showed you is executed. So this is before anything is done.
This is after the first action, after the second action, after the third action, and after the fourth action.
So you see that after all of those views, those 4 views that it took, it's populated the world with a number of stereo hits, some correct, some incorrect, wherever you see X it's determined that I cannot -- the robot cannot get to that point because it's not a straight line free floor path from its current position. And the probabilities of the sum of the probabilities of the sensed sphere at each position is indicated in the gray scale.
So after the 4 viewpoints from here, it determines that the next best position is here.
Okay? So it moves to there and starts taking its views again. So these will be in blue.
So it takes a view there. It takes a view there. This view is at tilt of 30 degrees because looking backwards it already has seen the horizontal plane. So now it tilts the camera up 30 degrees to take a different view. But look over -- but over that same area. Over here. And then has to move again. So after all of that process, it's decided that all of -- all of this is inaccessible, all the X parts. All of this is obstacles, so that's not useful. So of the only free floor positions, that's the only one available to it. That's the strongest one. So it moves to there. It looks again out here. You see this map is now getting a little bit complex as you were asking in terms of how does it evolve. And takes a few views, another 30 degree view because it had -- looks back over an area that it's been already. And then runs out of views. And asks the question again, where is my next position.
So here all of these now are dark because it's already been here and seen all of this, but it hasn't seen things over here. So that's the next best position. Remember the target is down here. So it goes over to that position, chooses to look in that direction first, and then with the second view finds the object. So basically it's had four positions and at most four views from each position before finding the object. With that being the result and map of the probabilities throughout the occupancy grid at the horizontal level only.
>>: So [inaudible] the uncertainty of its currently location because it's moving and it's some sort of dead reckoning.
>> John K. Tsotsos: It's all dead reckoning. It's all dead reckoning. If we had -- so as you'll see in the experiment afterwards, if the performance is actually pretty good considering how stupid the planning and the stereo is. This isn't a great stereo algorithm either. So the performance actually pretty good. And it probably would be perfect if we had a better planner in stereo algorithm. You'll see.
This is an example of what the stereo algorithm gives you. This is just point gray's stereo straight and there's the target. And this is sufficient to be able to do what I just showed you. So this is the very poor stereo reconstruction by any kind of imagination, but it's sufficient to deal with the problem of active search.
Let me show you a different example. This is a smaller object. It's rotated from the position that it was learned by the system. And it's partially occluded. So it's back, back here. You'll see it a little bit later. Robot starts off initially the same. So it's a little bit more difficult.
And this time we've included here the image that the robot actually sees. This is the image that the robot has to search from every view. Everything else is the same. So this time it takes this view first. Notice how the depth of field is different than it was in the previous example. This is because the object is different. So this is computed separately for every object. The object was smaller this time. So it can't be as far away.
It takes it's views from that position. So wide variety of images that it has to test.
Moves to that position. The object is there, but it's too far away and too occluded and its outside the depth of field so it can't recognize it at that position. Looks through all of this, again 30 degrees because overlapping. So again if you look at these images it's really like a little kid. And I've presented this to people who are looking at developing kids and they say it looks like it's exactly what they do, they just sort of walk around and look for things.
So it's back there, but the robot's not pointing at it. So it's getting closer now because there's where the object is. And it's missing it from every view and then finally it sees it at a viewpoint here, but that viewpoint is too occluded. It didn't recognize it at that point because it's -- the occlusion is too great for it to reach its threshold for recognition. It moves to the next position and now it finds it because it has a viewpoint that it actually recognizes sufficiently.
So again, these are four movements, as most four images at each one of those. So again, that's a pretty good kind of a result when you have it. When we've done this live to visitors and we just let the system go until it finds it, we allow visitors to put objects wherever they want, it's a pretty robust system because it doesn't give up and it doesn't get lost. So it really keeps going until it finds the object.
At one point dead reckoning comments in, so it fails, so like that part we'll leave aside.
Yes?
>>: [inaudible] the object [inaudible] well, what if I -- if there's nothing of -- like any sort of object [inaudible] then if it doesn't have information, so what's the behavior [inaudible] in those cases?
>> John K. Tsotsos: You me in the first shot, in the first image?
>>: [inaudible].
>> John K. Tsotsos: In both of these examples it could not see it in the first image.
>>: [inaudible] why would it be at the right decision after that? And does it have any evidence for the [inaudible] about where the next -- so the problem [inaudible] you can't see anything anyway, right? So what makes it the correct step in the first place?
>> John K. Tsotsos: So there's nothing that makes it take the correct step, whatever correct means, for the first several. It's simply exploring and gathering evidence about what exists in the world. Basically it's saying for the first few steps it's basically deciding where to not look again. Okay? So that's important. So only once it starts and it really
-- the whole thing is really a process of elimination. It's really -- I've looked here and I can't find it, I've looked here and I can't find it, and I'm just going to keep looking everywhere else until I final something. So it just eliminates as you go through. So it's kind of a search pruning process as opposed to something that directs it to a particular position. Okay?
>>: You mention [inaudible] gives up, but if your exclusion map is already include the host space then it should give up, right?
>> John K. Tsotsos: If it actually covers everything and there's -- and the thresholds -- we'd have to turn off the thresholds, for example, for it to never give up as well, because right now we limit it to the point where when it decides on a particular next position, it needs to ensure that the total sum of the probabilities in that particular sense sphere is greater than a threshold, otherwise it won't go there. So it prioritizes. And we turn that off, then all of a sudden all of the probabilities are in action. Even if they're small.
So it don't give up until all of that stuff is exhausted. That's what I mean. The examples
I showed you take about 15 minutes of realtime. Not optimize, not a fancy computer or anything. So I'm sure this would be close to, you know, just a couple of minutes if we really chose to push on it.
There are lots of other search robots out there of a variety of kinds. There's search and rescue robots and lots of others that keep searching. None of them do exactly this because a lot of them try to find optimal paths in an unknown environment and search while looking for an optimal path, for example. Or lots of them look for shortest paths through environments and that sort of thing.
One that's an interesting one that we should compare it to is work of Sebastian Tru
[phonetic] and his group that have looked at palm DPs for solving a similar problem. So the natural question that you could pose to me is why aren't we using a palm DP? So there are a couple of reasons. I mean, first of all, we started this before palm DPs appeared. So it was, you know, we wanted to see this through to the end. That's not a
-- that's not a solid science reason, though. The solid science reason is that the space we have is more larger than what palm DPs can deal with. We can't make assumptions for the nature of the probability distributions within the system in order to simplify things.
And the number of states that we have is much larger. I showed you the numbers of -- numbers of actions and number of positions and so forth. The kind of things that people have looked at in the palm DP world real look at small numbers of states and observations. They look at hallway navigation. And that's kind of the largest number of states that I've seen. Which is orders of magnitude smaller than what we need to deal with.
So our problem is just a larger problem. It's also when you think about the solution, it's also very intuitive kind of a solution as opposed to what the palm DP does, and it's intuitive in the sense that it -- when I presented this, people have come up to me and you know pointed the [inaudible] paper to me that look at ideal observers in human visual search and showed that there seems to be a good similarity between the way that we've done it and the way that experimentalists have found people do it. So we're actually currently investigating whether or not there's a stronger relationship between our method and the way humans do it.
So we have a couple of reasons for not going the palm DP route that I think are pretty solid and interesting. We can examine the different search strategies, though. I've shown you only one search strategy, namely, you know, I look first here, I examine everything here and then I move to another location that has high probabilities. There can be other things. So here's -- so we did it -- decided to do an experiment looking at four different search strategies. This is the one that was present in the examples that had just shown you. Explore the current position first and then the next position maximizes detection probability.
We could choose something, an action like this, where you choose the action pan tilt XY with the largest detection probability. So every time you know we decide the best place for me to look at is over there, I'm going to move there and look at an image. And then move over there and look at an image. And then move over there and look at an image, one image, and so forth. So that's a strategy.
We could explore the current position first. This is strategy C. But the next position maximized detection probability while minimizing distance of the position. So I'm going to be looking at other places to go but I want it to be something where the probability is high but the distance is also close. So that I minimize the amount of travel that I have.
Or I could have a strategy D here, where I've kind of relaxed the distance requirement so it's not so strict. So it's okay to be a little bit far, but not to be, you know, the absolute minimum.
So we ran an experiment using those four strategies, and we also wanted to include to see the effect of prior knowledge. So what we did is the following. This is our room.
We started -- we had the robot start in five different positions, and we placed this object, a cup, in four different positions as well. We use SIFT. If you remember I mentioned it didn't matter what recognizer we used here. The texture gave us -- SIFT gave us good recognition performance because of the texture, so we used that. And we had -- sometimes we had no prior knowledge, in other words, exactly like the examples I've shown you and other times we allowed it to know that the target object is placed on a table. Not under a table, not over a table, on a table.
And then we had all those different combinations. So for each combination we ran 20 experiments. We ran the whole recognition algorithm 20 times for each combination of all of these conditions. So the total number of runs was 160. It found the object within a hundred -- for 145 of those, and when we looked at the reasons for failure for the other
15, it was dead reckoning for one because it would get itself lost sometimes. And unreliable stereo, because sometimes you get stereo hits where there actually are no objects. So it kind of blocks the system. So I think that if we added those two things instead of having really dumb methods for navigation and stereo and had better once, I think we would have very high performance.
And here are the results. For each one of these cases, what we did is measure the number of actions, the total time in minutes, and the total distance traveled in meters.
And these are the four search strategies that we looked at. The top one has no prior knowledge, this one the target is on one of the tables. Doesn't know which one, but it's on one of the tables. And if you look through this, you'll see that strategy C wins
throughout, everywhere. Namely the strategy that says I'm going to choose my next action by maximizing the probability while minimizing distance to that position.
So in all cases that wins. So the best strategy is that. And some knowledge is always better than no knowledge in this case. So suppose -- so now you have the ability to decide for a particular robot task what is -- what is important to minimize? Is it time, sit the number of actions or is it the distance? And you can choose which one of these strategies in order to minimize which of those. So in all cases, choosing strategy C plus knowledge, that minimizes time for action you have a number of different chooses plus knowledge. A, C, or D. B is the worst one. So you never choose B. And for minimizing distance it's choosing action C with prior knowledge.
So this allows you then to tailor your particular action, depending on what's important in terms of minimization. You're running out of power, you want to minimize time and or distance, program, that sort of thing.
>>: [inaudible] minimize the distance and the [inaudible] how will you combine the two?
Is there any principle [inaudible].
>> John K. Tsotsos: That.
>>: Why you use that?
>> John K. Tsotsos: Oh, why? I'm sorry?
>>: Divide by the distance, but what's the --
>> John K. Tsotsos: It's just a -- I think probably the most naive way of putting both together. I'm sure there are other ways of doing it, but that's kind of the most naive way of doing it. That's what we did for this experiment.
I think that there's -- I think there's no claim that those are the only strategies that one can use. I'm sure you can come up with other strategies. I'm sure you can formulate them differently. That's what we used and that's the result that we got.
>>: Maybe in something like a resource [inaudible] when you travel long you need a --
>> John K. Tsotsos: Oh, yeah, I think that there are lots of ways of dealing with this.
So we haven't had -- so I think that if you're going to be using this say in -- so when
Yiming Ye was doing this initially it was for a robot that would be used in a nuclear power plant. So there you have a different set of constraints in terms of where you'll find available power for recharging the robot and, you know, there's a lot of different other resources that can go in there. So these strategies would change depending on it.
For our simple case scenario, what we've done here is show how you can have different strategies and choose amongst them, how they do have a definite effect on the performance, and how it's possible to then tailor your robot, depending on what you want to optimize. That's all.
So yes?
>>: The numbers you have shown are not the average numbers. I wonder how big are the variances [inaudible].
>> John K. Tsotsos: Yes, of course it does. And actually I don't know. I think that -- so, in terms of time, I can tell you that I can't recall -- I can't recall it going for more than
25 minutes ever. For anything ever. And I can't recall it ever finding anything in under five minutes or so. I mean, so I -- so I don't think the variances, I don't think the variance in these particular examples would be so large. I think it would actually be kind of small. But I don't know the numbers. It's a good question. I just -- if Sonia were here, she could tell you.
So this is Sonia's, so I want to thank her for doing all of the work. It was she who did all the programming on the pioneer. So let's conclude. Have we made any steps towards the dream? Remember the dream that we initially had? I think we kind of have. We have a visual object search strategy which we've shown to work and has good performance characteristics and has several things that we can push on to improve it. I haven't shown you, but we have a suite of other supporting visual behaviors for the wheelchair like opening a door and visual slam and so forth, obstacle detection by stereo, loan. So all of these things are all part of the wheelchair system.
We're currently working on integrating all of those into the system, generalizing doorway behavior so it recognizes arbitrary doors. Monitoring the user, which is very important.
As I was explaining to Zhengyou earlier, we haven't used -- we haven't tested this on users just because we can't get ethics approval at this stage because we can't prove that it's safe for a disabled user. So we're working on monitoring the user visually to be able to detect whether or not they might be distressed in order to be able to stop the system. And thus try and achieve ethics approval to do more.
And design is something that has to come still. Because the robot, the wheelchair robot currently looks like a mess as most experimental things do. So we need to have a nice fancy design like Zhengyou has for his table top system. So that needs to be done as well.
But from the perspective of the bulk of this presentation, I think that the visual object search method that we have has shown to be a powerful enough to deal with simple situations under pretty difficult conditions, difficulty here meaning that we have no path planning and we have very low quality stereo and so forth. So by improving those, by adding in the ability to learn objects and have viewpoint generality, I think we could improve this to have far superior performance to any existing thing. So if you have any further questions, I'd be happy to answer them.
>>: So the one question I have is you don't assume the knowledge of the environment but you still [inaudible].
>> John K. Tsotsos: Yes.
>>: Which is [inaudible]. I'm wondering whether there's any implication if you merely assume it's [inaudible] in your [inaudible].
>> John K. Tsotsos: Well, the implication would be first of all that the reason why we use -- where we use the box is to set the number of cubes in the occupancy grid and to be able to set the initial probabilities. So if you have no limits and you just don't know what the external is, I would say that the first thing that one would do is to have to arbitrarily set, pretend I'm in a box, or I'm in a fixed area. And I'm going to search that fixed area because -- and I'm going to set the probabilities in this fixed area. And you don't move outside that fixed area until you're certain that it's not found, and then you would move to a fixed area. So that's -- that would be one way to sort of punt on that problem.
>>: [inaudible] maximize probability, right, [inaudible]. So this assumes that you already have particular model of probability, right? If this -- since your active [inaudible] is based on the probability you cannot assume that you already know, you know, you have a good model probability but how about thinking of an [inaudible] and that tries to maximize the peakiness of the distribution? I'm kind of trying to think of in those cases when your probability distributions are not good, I mean, your probably making incorrect moves, so especially in the beginning of the [inaudible] so if you replace this by other criteria which is trying to sort of figure out that where is the maximum information I can get, not just about the object, but in general about the whole probability distribution, like information [inaudible] and I mean just thinking, did you have a chance to think about this [inaudible].
>> John K. Tsotsos: Not in this context, but another one of my PhD students who is now at [inaudible] as a post-doc, looked at information maximization in visual attention.
So he has a model of saliency that is not dependent simply on combining all the features into a conspicuity map as is most common, but rather in choosing where to look, depending on exactly where you would get the most information by looking. So we've done that on the side of saliency and attention. And that's kind of a precursor to adding it to this system. So from that perspective, I agree with you, it is a reasonable thing to try. And we've looked at some of the first steps there.
>>: [inaudible] but my [inaudible] in the beginning it might give you more [inaudible] but
--
>> John K. Tsotsos: It might. On the other hand, it's hard to say, though, because you know, it's an environment where you actually -- you know nothing.
>>: It's probably [inaudible].
>> John K. Tsotsos: It's a complex enough problem that you just have to try things in order to see if it works out.
>>: So compressive sensing has been pretty hot these days. I wonder if comparing with compressive sensing [inaudible].
>> John K. Tsotsos: I'm sorry. With which?
>>: Compressive sensing.
>> John K. Tsotsos: Compressive sensing?
>>: Right. Which is none of that [inaudible] guarantee to have a [inaudible] determine your search strategy based on the previous observations. So I'm wondering if you could comment on [inaudible].
>> John K. Tsotsos: I have to admit my ignorance. I don't know what compressive sensing is, so if you tell me quickly, I'll be able to comment better.
>>: Compressive sensing is a bunch of [inaudible] in order to in order to [inaudible] and to not have to [inaudible] and as long as you assume some sparsity on the signal, then you'll be able to reconstruct the signal based on the [inaudible].
>> John K. Tsotsos: And people do recognition on this other signal? Or do you have to reconstruct the signal and then do recognition?
>>: There's not much recognition.
>> John K. Tsotsos: So for us, the key is not the size of signal at all, it's doing the recognition and more importantly how do you decide where to look. That's really -- that's really question is there's a lot of computer vision work on recognition given an image and there's lots of work on compression, given an image, but here we have to decide which is the image that you want to act on. So the bulk of our work is on deciding when is that image that you want to do recognition on. It's not simply doing the recognition.
So from what you just described, I'm afraid I don't -- I don't have enough to be able to make a better comment. But this is really an exercise in determining how to do signal acquisition, not transmission, not the recognition, not anything. It's how do you acquire the right signal in the shortest number of steps in order to find something.
>>: And I think that's probably relates to compressive sensing. So the signal here is really the proper distribution that you care about, and you go and stop [inaudible] and then you [inaudible]. Rather than compressive sensing, you already have a signal and you [inaudible] where to send it and so that you can reconstruct it. So I guess the difference is you don't even know the signal and that's what you referring to.
>> John K. Tsotsos: Yeah, you don't know. You don't have that distribution initially.
You have to create it by moving around. You just don't have it.
>> Zhengyou Zhang: Okay. Thank you very much.
>> John K. Tsotsos: Thank you.
[applause]