17590 >> John Crumb: Hi. So I'm John Crumb...

advertisement

17590

>> John Crumb: Hi. So I'm John Crumb from Microsoft Research and this is Emma Brunskill.

She is now an NSF post-doctoral fellow at UCAL Berkeley. Before that she got her Ph.D. from

MIT. Before that she got her Master's at Oxford University. And before that she was at the

University of Washington.

So she's going to talk about how to interpret casually given directions to get places.

>> Emma Brunskill: Feel free to interrupt me with any questions during the talk. So, first off, I'm going to start with a video to illustrate the types of tasks.

[video]

In case you didn't catch that, at the bottom I've written out the instructions that Tom gave to the robot. So Tom told the robot to take him down the hallway past the railing to the kitchen on the right. You can see this is what this automated wheelchair is doing. It's followed those directions.

It went past the railing and now Tom is going to end up actually in the kitchen.

So this is the type of task we'd like to be able to do. We'd like people to naturally give directions and then a robot or an agent to interpret what is the physical path that the user means by those directions.

So given a sequence of verbal directions, what physical path does this correspond to? And there's a number of reasons why we think this is an interesting problem. So the first is for the example we've just seen, for automated wheelchairs. There's a number of cases where people may not have the dexterity to handle a manual wheelchair or even a wheelchair with a joystick but they might still have possession of their vocal cords.

We'd like people to naturally interact with the wheelchair in particular over long distances just say take me down the hall across the street to the Post Office or other types of applications.

The second application we're interested in is home assistant robots. So I would like to be able to have a robot operate in my home and say to the robot go down the hallway past the bedroom and get my book in the middle of the living room and I'd like to be able to do for a number of different items and say it as I would to a spouse or a friend instead of using a specific language or a specific set of words for a robot.

Similarly, we could use this for automated vehicles. I could jump into a cab and just tell the cab driver that could be an automated agent, you know, go across town until you hit the stadium and then once you're at the stadium I want to be dropped off at the third gate.

So all of these applications are ones that we're interested in. And, in general, we think this sort of representation is a way to try to facilitate natural human agent interaction where the agent is either a robot or other sort of autonomous agent. And also we think this sort of research can be useful for generating directions when agents are trying to give directions to humans. If a robot was trying to explain to a human how to get to somewhere, it would be useful to know what type of representations tend to be most useful for humans and think about the process from that perspective.

Now, why might this be hard? Well, first of all, there's huge variation. There's a huge variation in the number of ways you could describe a particular physical path, and there's a number of reasons for this.

So let's take this picture. So this is a picture taken from one of the spaces in the MIT computer science building. You can see a microwave. You can see a fridge. I would call this sort of a kitchen or kitchenette. And one person might say something like go past the microwave. Another person might use a higher level representation like go past the kitchen. And someone might simply say go straight, because there's no where to turn into this, and they might just refer to someone going forward.

So you can see here there are both different levels of describing objects. And there's also the notion of geometry versus landmarks. In the example I just showed you, someone might have said just go straight versus referring to any landmarks, but in general, depending on the environment you're in, it may be more or less natural to talk about landmarks versus geometry.

So if you have a highly regular structure like this, it might be most natural to refer to going left, going right or going straight. In other cases where you have very large open spaces, with a lot of different objects, then it might be most natural to talk about landmarks like go to the information desk. Or if you're in a museum, go to the modern art section.

So in this talk I'm going to be focusing on the scenario where the person, the person or the agent that's trying to interpret the direction actually has access to a map. And so one of the first issues we have to think about in terms of this is what type of map do we have? How are we going to represent the environment?

Once we have that environmental representation, I'm going to focus on the main bulk of the talk which is given a set of directions how do we figure out what is the physical path associated with those directions.

Then at the end I'm going to consider two further things, first of all, how can we improve these results by asking questions and what would we do if we didn't have a map in advance?

So, first of all, I'm going to talk about how we might want to represent the world. I'm sorry that got cut off.

What we're going to do here is we're going to drive a robot around an environment, and the robot here is equipped with a SIC which is a laser range scanner, for those who aren't familiar with it gives you range information to sort of 180 degrees around the robot.

And what you can see here is that as a robot drives around an environment, it can use that range information to construct sort of a metric map of the environment. And you can think of this as being grid cells that are either occupied or not occupied.

In addition, we can take camera images to augment our representation of the environment. Now, mapping and robotics has been a really big topic. Most maps look something like the thing on the left.

So in particular this is the third floor of the MIT computer science building on the left. And that's what I would refer to as sort of an occupancy grid or metric map. Consists of a number of

different cells which are either occupied or not. And you can see a potential path that the robot could take through that environment.

But in general people don't tend to think about maps on that level of resolution. I don't tend to have a map in my head of sort of from this one-third is either occupied or not. I tend to think of it at a much higher level in terms of rooms and hallways and things like that.

So in some of our prior work we thought about constructing hybrid metric topological maps. And this means that we're going to have sub maps which are small sections of the environment where we do kind of have this occupancy grid representation. But then between sub maps we're just going to have rough connections.

So I have sort of a topological representation where each of these sub maps are nodes. And in general there are a number of different ways to construct hybrid sort of metric topological maps.

We were using spectral clustering to sort of separate the environment to regions that are tightly connected versus sparsely connected.

And we can do all this in an automatic fashion. So we can have our robot drive around the environment and then construct this representation of the environment. And then what we're going to be trying to do is think about navigating through these topological nodes. And we're going to assume that we know how to get from between -- so within a topological node but we'll be interested in which connection should we be following at that higher level.

So that means when we get a set of directions what we're going to be thinking about is a physical path where a physical path is a series of topological nodes through our hybrid metric environment.

And in particular we're going to pose this as a global inference problem. So we're going to be thinking about what's the most likely sequence of physical regions or physical topological nodes corresponding to a set of directions.

And we're going to put this as a graphical bottle. In particular, we're going to phrase this as a

Markov random field. So throughout the talk I'm going to use R to describe these topological nodes which consist of these metric sub maps, and then L to refer to landmarks, which are things like railings or kitchens or fridge or a number of other types of object representations.

So given that, we can write down what is the probability of an underlying set of physical regions given a set of landmarks, and we're going to do this by thinking about two different types of potentials. So they're going to be potentials associated with landmarks and regions and then potentials that are only associated with regions.

>>: So here Markov random field is not directly direct, when you have an arrow what does it mean.

>> Emma Brunskill: That's true. In general it's not an undirected graph. We're sort of -- I put the arrows there because I think of it in terms of the temporal sequence, but you're right that we don't actually have to represent those. In general these are --

>>: You do have the causal relationship to think about as the bayesian network.

>> Emma Brunskill: You can -- what we originally started with was with a hidden Markov model.

I'll talk about why that doesn't end up being appropriate for this application.

>>: Your goal is to infer the sequence R 1 through RN.

>> Emma Brunskill: Exactly.

>>: So that gives you the region in the map, but it doesn't yet tell you how to navigate within that region?

>> Emma Brunskill: It doesn't tell you how to navigate within the sub map regions, but the assumption is that if you're a robot doing something like line following or something like that, you already have primitive motor actions to get you in between connected topological pieces.

>>: Okay.

>> Emma Brunskill: So I just answered the process that we're going to be trying to do is try to find the sequence of physical regions that maximizes this likelihood score. And in particular that's going to mean that we need to have ways to represent the potentials of the connection between regions and landmarks and the potentials across regions.

So how might we do this? Well, let's think about our map. What we'd like to be able to do is say if someone talks about a conference room we'd like to think about what's the probability or what's the potential that region R 1 is the region that they're referring to when they say conference room.

And this is where the variation gets slightly tricky. So someone could have said conference room, or they could have said go past the projector, or they could have said there will be a room with a lot of chairs. There's really in general a huge number of ways you could describe a single region.

So one possibility is that we could sit down and we could train a huge number of different object detectors and we could run those object detectors on our environment and we could collect a very large set.

We could do this for all possible landmarks that we think someone might mention. But there are two problems with this. One that's a really costly process. It typically takes a while to train some of these object detectors while some of them are available there's a lot of object classes that do not have previous object detectors that you could just download off the Web. The second is even if we did this for an extremely large class, say 100 or a thousand different landmarks it's still fairly brittle because someone could come along and use a slightly different word that wasn't in our set and we wouldn't be able to have good observation potentials for this new landmark.

So we're going to take a different approach. We're going to do something I refer to as detect and bootstrap. The idea is we're going to build up a few object detectors and then we're going to run that in our environment and that's going to give us a small set of object classifications or object detections.

But then what we're going to do is leverage contextual relationships in an existing database in order to essentially infer landmark probabilities for huge number of other landmarks.

So for the first part of this we're just going to train up a few object detectors. And so, for example, here we've got railings. We've got doors. We use Belgian swab algorithms but there are a lot of existing object detector algorithms you can use. The key part we have to label images by hand and train up these object detectors and run them over our environment.

Now, what do I mean by contextual relationships? So this is probably looking fairly familiar given the space we're in right now. So you can imagine that maybe you're in a space where there are chairs. There might be a couple doors. There might be a screen somewhere. And there's a clock.

And in fact often we see these types of relationships in conference rooms, just like we're in right now. So in general you wouldn't be surprised if someone referred to any of these different landmarks for this particular environment, if they called it a conference room, if they said there's a screen, if they said there's chairs, or a door. And we expect that a lot of these items tend to be co-located in this sort of environment.

So where might we come up with a place that already has these types of contextual relationships preencoded? Well, the nice thing is there's an extremely large database called Flickr which is publicly available where people upload pictures and they tag them. For example, in this image they've already tagged it could be a conference room and there are chairs and screens and things like that.

If we look at downloading a bunch of images from Flickr and look at what the tags and co-occurrences are, they're sort of what you might naturally expect. So if we look at desk, the most common tag that co-occurs with desk is office.

We also see things like computer and Mac and work and all these things could be useful in terms of trying to describe spatial regions, and they're sort of different ways that people might describe the same spatial region, because that's exactly what Flickr does.

>>: So to make sure I'm understanding, words that could have occurred, those are the physical.

>> Emma Brunskill: Yes for all of our desk images, for images that had tags of desks what are the other tags this those images have.

>>: So that means let's say office co-occurs.

>> Emma Brunskill: Huge number of times. So the images we downloaded, the most frequent other tag that was co-occurring with desk was office.

>>: That's within the same image.

>> Emma Brunskill: Within the same image, yeah, exactly. These are all tags for the same image from a very large set of images we downloaded from Flickr.

>>: So that's a pretty good list. I was trying to see if there were any firms or adjectives in there besides object. I see a few. There's black. And there's work. So you didn't do any kind of filtering to get a nice set of --

>> Emma Brunskill: No, we didn't. So, I mean you can imagine during filtering on top of this.

But, no, there's some stuff lower down that I wouldn't have thought of as much. It sort of makes sense like would and university I wouldn't have thought of those as the first words I would think about spatial relationships and desks.

But because we're only going to end up using words that people actually use in their direction set and using those co-occurrences, then it doesn't -- if they're words that people don't end up using much then it's not much a problem. We're also doing fairly simple parsing right now so we're only sort of looking for announce and a couple different types of verbs.

If we're doing more sophisticated verbal analysis, then it's possible you'd get confusions if you had a lot of other verbs that were co-occurring.

So how do we use this? In order to compute these observation factors or observation potentials, what we first do is we take our existing map and then we run our object detectors we pretrained on a small set of object classes.

For example, in this case we would have had to have trained a screen classifier, a chair classifier, a door classifier and a clock classifier. And in this environment, this particular region has all four of those.

And then for a particular landmark, say L-K, then the potential is basically a combination of considering all the different objects that are detected in that region and that landmark.

So we're sort of using almost a Nave-based thing here. Just taking the product of these things individually. And then the way we compute an individual potential is just looking at co-occurrence data. So we simply look at how often was a screen co-occurring in the same image as conference room in our Flickr dataset. The nice thing is we can do this for a very large number of different landmarks. So roughly on the order of 20 to 25,000.

So that's how we do the observation computation. We compute the observation potentials. The other thing that we want to think about is the sort of potentials that have to do only with the physical regions themselves.

So remember that I said earlier that what we've got here is a topological map. So here I put red dots in sort of in the center of each region that would be sort of a topological node in this environment.

And what we're going to represent the potentials as is they're going to be one if a region is connected to another region. So if it's a neighbor. It's also one if it's itself. So you could have self-transitions.

It's zero if there's a loop. So if there's a non-self loop those are unallowed. And then at zero otherwise to prevent, you can't tele port from here to here, at least not yet.

Now, this is where you might be wondering or you might have been wondering before and you might wonder again now why are we using a Markov random field instead of an HMM or other type of representation? So this has to do with the connectivity of the environment.

So let's consider this particular topological map. So the point up above at the top is connected to five other regions. This point down here is only connected to two other regions, and in fact the point over on the far left is actually connected to only one other regions.

So we have different types of connectivity within the environment. Now, why is that important?

Well, recall from the previous slide that what we're doing is we just have sort of a uniform potential of transitioning to any other node that you're connected to. So on the previous slide here you just have a uniform potential for any other immediate neighbor or for transitioning to yourself.

Now, if we're in a hidden Markov framework then you normalize the transition probabilities, which means that if you're in a highly connected region, then the probability of transitioning to any other region is lower than it is in low connected regions.

So, for example, in this case if we ignore self-transitions for a second, then there's a 20 percent probability of this transitioning to any other node, whereas here there's a 50 percent transition probability.

And this ends up biassing your inference. So if you use a hidden Markov approach where you have to normalize these then you effectively penalize the path that goes through highly connected regions. In general we don't want to have a bias either way. We don't want to bias our most likely paths to go through highly connected regions or low connected regions.

If we instead use this Markov random field approach we can have flat potentials and we don't have to normalize them for an individual transition. That's why we made that choice.

Now I've described to you how we could compute these observation potentials and these transition potentials, and once we have both of these, then we can use pretty standard algorithms to actually do this most likely physical path computation.

So we're going to use a variance of the Viterbi algorithm, which for those who aren't familiar with this, it's a popular algorithm that was invented in the late '60s and has been used a lot in sort of speech processing for whenever you have a sequence of hidden states and a set of observations.

And now one thing I should mention here is that to start we're assuming that we have a one-to-one correspondence, which means we're assuming someone is going to mention every single region they pass through.

We'll consider relaxing that later but that's the initial assumption. And then we're going to figure out the most likely sequence of physical regions that are associated with that direction set.

So if we think about the complete process, what we're first going to do is we're going to have our robot traverse the environment. And when it does this, it's going to build up sort of metric map of the environment and it's going to have a lot of different associated images.

So we're going to take a large number of pictures as we traverse this environment. Then once we have that original data, we're going to compute a map representation. So we're going to compute a hybrid metric topological map, and then we're also going to run an object detector for a small set of object classes over that environment.

So here we've done extracted chairs and we're going to compute all the relevant potentials for a number of landmarks. Now what we do initially here is we compute potentials for I think roughly

25,000 landmarks. But you can always add additional landmarks later if anything appears in your actual direction set that you didn't see originally, and those are very easy to compute, because you don't have to train new object detectors.

Then what we're going to do is actually collect directions from people, and then we're going to try to parse those directions and infer the path that person meant.

So what sort of dataset are we looking at here? This is the third floor of the computer science building at MIT. You can see it's sort of an interesting building. It was designed by Frank Gary.

It can be sometimes confusing to look at these maps if you haven't seen these before. This is from a laser range scanner. You can see sometimes that the laser range scanner sort of bottoms out. It has a fixed range it can get to. You can see these sweeping things if it doesn't hit that range.

In general, one of the interesting things about this environment is that it's not particularly rectilinear. There's a lot of open spaces and sort of these interesting spaces up here.

This is the environment that we ask people to actually give directions within. We also have vision odometry lidar, and then a small set of detections for object classes.

>>: So how does the direction that humans give go into this Markov group, is that treated as observations.

>> Emma Brunskill: Treated as observations.

>>: The word sequence observation.

>> Emma Brunskill: The word sequence we parse it to extract nouns and a few other keywords and then those become the observations.

So we take this dataset and just to give you a sense, we're running a few different object detectors on here. And so in this case we have monitors. And you can see that monitors only appear in a subsection of the regions.

And these object detections are what we're using to bootstrap the Flickr learning to compute all those observation potentials. So monitor, chairs and a couple other object classes is what we're using when we're trying to compute all of the probabilities that in other words say computer or keyboard is going to be co-located with these detections that we do have in case someone else refers to those.

>>: It seems like at this point you'd be in a great position to automatically label the rooms, too.

>> Emma Brunskill: You mean in terms of type and stuff, like kitchen --

>>: Right.

>> Emma Brunskill: Absolutely. Yeah. And one of the things that my colleague, Tom Collar is working on, too, is things like you can use this to figure out the type of the room and what objects are co-located with it, which means the device says something like go get me a coffee cup it could figure out where the space is, where it's most likely to be, which is kind of nice.

So what was the experiment we did? So we took this area, and then we segmented it. There are how many -- roughly 24 regions here. And then we gave this map two subjects and we said go explore the environment and then give people directions between a starting and location of one of these regions, but you can't use these region labels to refer to the environment. Just do it as if someone came up and asked you a question about getting to this particular point and how you would give them directions.

And then we got other people to try to follow those directions in this environment without these labels. So as I said before we have about 25,000 landmarks. We used Word Net to sort of extract different levels of the types of landmarks people might use for spatial information. So you can go sort of up or down in granularity from kitchen down to spoon and things like that.

But these are some of the images from the environment. Just to give you a sense of the number of different ways you could imagine referring to this. So depending on what is salient to you, you might pick up on the fact that there's a bike here. You might refer to this as an office. This might be a fridge or a kitchen. There's monitors.

So there's really a very large number. We didn't ever encounter words that weren't in our initial database in the directions but it would be very easy to add that on top because of the very large nature of the Flickr dataset and the tags available.

So what's an example subject direction? So one example would be head down the hallway with the open area on your left and the railing on the left. At the end of the hallway take a left head through the open area where the computer's on your right and head into the conference room across the bridge to your right. That would be an example direction.

>>: So these are all highlights? [phonetic].

>> Emma Brunskill: Originally we were doing parsing by hand for the first set of data. The second set of data we're doing automatic parsing but doing nouns and a couple of known verbs.

>>: The sequence as opposed to which goes first, do you put through everything as bag of words or --

>> Emma Brunskill: No it's in sequence. So we assume -- so we preserve the order but we're still throwing out a lot of other information that you could imagine getting just from the verbal parse.

Now one thing we saw which was expected is there's a lot of variation in how people give variations. The next three examples are for the same physical path and yet people use very different ways to describe it.

For example, here people say proceed past the kitchen and walk through the pair of doors that are on your left. So this person is picking up on the fact that for them a kitchen is a salient feature. And that's something someone's going to pass through.

For the next person, they didn't refer to any landmarks in this part. They said go through the doors and all the way down the hallway. For them geometry was sufficient. Go straight don't have to make a decision for a while doesn't matter that there's a kitchen near there. I'm not going to bother to represent that part.

Now, another person actually referred to sort of specific objects like copy machines and a restroom. Now this is interesting because a restroom to me is on the same sort of semantic generality as a kitchen. But yet they didn't bother to mention the kitchen because the kitchen wasn't relevant for them.

So in these cases you can see that people using very different types of representations in order to represent the same physical path. So these are initial results. So the first thing to note is that this is a fairly challenging task.

So for any of you who have ever been to a Gary building, Gary buildings tend to be quite unusual and very tricky. I always find that when we were trying to order pizza or something we'd normally just go get the delivery person because it was much too hard to give directions.

People could follow these directions about 55 out of 80 of the times. But one thing that was encouraging to us is that our Markov random field approach got 47 of those correct. So it's a challenging task but we're doing pretty well relative to humans in this set.

Now, another thing you can imagine doing here is just guessing which spot you're going to end up in, and that would be only about five correct. So it's much more -- we're doing much better than chance guessing of where you should end up.

The next thing we see here is that a Markov random field really is a better approach than the hidden Markov model in this case because of the normalization I was discussing earlier. This becomes a particularly big issue the longer your path length.

So if your paths are only sort of one or two, you've got less of a likelihood of actually having to traverse through one of these highly connected regions, but as your paths get longer it's more likely that you're going to go through one or two highly connected regions, which severely penalize the likelihood and the hidden Markov model case.

So as you continue to go longer and longer past the hidden Markov model will try to avoid these highly connected regions, more and more, where the other ones, both people and the Markov random field aren't trying to penalize highly connected regions.

So we're fairly happy with this first environment, but it's still not that large of an environment. We wanted to try bigger examples. So the second environment we looked at was a multi-building environment. And on the MIT campus a lot of the buildings are connected. So even though this is all a single floor, it represents a number of different buildings.

And this is an interesting environment for a couple of reasons: You can see already that it has a very different structure than the last picture I showed you. So there's a lot more rectilinear structure and it's larger than before.

>>: So how for human, for human you throw away a lot of words, in other words, such as turn to, turn right, versus on your right. Now in your observation we have both these right. You have to turn. So can you throw out all this information?

>> Emma Brunskill: That's a great point. So that's one of the -- that's going to be one of the challenges we show here then we incorporate turning right or turning left. And turning right or turning left end up becoming part of the region probability transitions instead of part of the observations.

>>: We make sure understanding of the natural language.

>> Emma Brunskill: Exactly.

>>: So what you're doing now you throw away a lot of --

>> Emma Brunskill: We throw away too much. I'm going to talk about the challenges that come up with the new dataset and where some of those are with the parsing and get a better sense from the natural languages.

>>: Can almost do as good as a human.

>> Emma Brunskill: No, it is. Well, I'll show new a second that's not the case in this environment.

So 15 subjects gave directions between ten pairs of locations in this environment. And then we had 15 other subjects try to follow those directions. And the interesting thing for us here is that people did very well. They got 85 percent correct. So this was a very easy environment for people to navigate in. And our approach got eight percent correct.

We did much worse than before and people did much better. We looked into why is this? What is it about this new environment and what are the new challenges? So the first challenge is that people don't describe all of the regions that they go through in the environment. So often people say things like starting in this hall, walk straight until -- that's a pretty natural thing to do, because a lot of these places there's really no choices. You can't do anything except for either stop or go forward.

So it's not -- people don't view it as necessary to describe all the things you're going through. In addition, some of these aren't particularly sort of visually salient. It can be a long corridor, a number of offices on either side but there's not a lot of things that you might expect people to bother to refer to.

That's one challenge. The second thing is people use a lot of geometry. They use things like turn right into the hall. Go straight until the hall Ts. They occasionally refer to or don't refer to corners, which often correlate to changes in specific regions but may or may not be places where people can make a choice. Where geometry becomes more important in this environment.

Another interesting thing is viewpoint is much more important. So because you're going to be doing a lot of these sort of rights and left without a lot of salient information, a lot of it depends on your orientation.

So right and left obviously depends on which way you're facing. If you don't incorporate that information you often do much more poorly.

And the final thing that I find particularly interesting is the use of sort of negative information, or side information. So people often say things like right before the glass doors, turn. Or if you reach the kitchen you've gone too far.

One of the interesting things to me about that is in a way you can think of those as virtual loops.

That that's either an actual loop that people are sort of making in their directions or virtual loop that you could be making, and that even though that makes the path longer, if it's a very visually salient thing, that can give you so much localization information that it's worth it to like have either this virtual loop or actually go further because then you completely collapse your uncertainty about where you are.

So how did we try to start tackling some of these challenges? The first thing is that we automated parsing before we were doing this by hand and that's not tractable when you start trying to do this for a lot larger environments for more people. And the second is that we're more systematically including right and left directives. Because that's something that people use very commonly and it's quite important.

The next thing we did was we added viewpoint to the hidden state. And we're using four different orientations right now. And the way we do this is that we search over viewpoints -- we always know which region you start in. We search over which viewpoint you might start in, and then essentially, instead of only trying to infer the physical region, we're trying to infer the physical region and the orientation. As you traverse through this environment.

And then the other thing we tried to start handling was skipped regions. So in a lot of cases, as I was talking about before, there's not a lot of choices. People will skip descriptions. And so you can imagine it's something like this. Someone might say go through a metal door, turn right, and then turn right again and you'll reach a room with a bench. And that might ignore all of this space, because there's really nothing else you can do and they usually assume you're going to go forward that means we've got a huge number of regions where there's no associated observation.

And so as an initial approach to this we allow there to be a fixed number of skips. So we say that people can incorporate here, can have a few different missing observations, but you can still have passed through a region.

And all these things help where it gets us up to about 30 percent. So it's promising that we can start to do better in this environment, but there's still a reasonable amount of work to go.

So what are some of the things that we think are important? We think we need to do even better alignment models. So we need to do skips more efficiently. So people in the sort of natural language community and the speech community have thought a lot about this sort of how do you align the observations to a set of hidden states. And so we're going to leverage more of their work and then we also want to think more about descriptions being out of order. So sometimes people will have sort of parenthetical descriptions or later in the sentence refer back to something they described before.

And if you just assume that everything is done in the order that someone should see them, then those things can mess up your directions.

The second is a better use of geometry. So there's a number of different types of geometrical things you can imagine referring to. There's left, there's right, there's straight, there's going around corners.

And this ends up being particularly important in these types of environments. And then also including things like negative information or side information.

So I talked about negative information before like these sort of self-loops you might do if you say things like if you reach this you're going too far. But people also use a lot of side information, things like on your left will be a kitchen. And so that means that it's not actually an observation associated with the region you're in right now. It means that it's just something that's viewable.

And so that's another way we'd like to incorporate viewpoint, is that if we know what the viewpoint is then we can use neighboring regions around your current region and think about any observations that might be visible from that section. And that should also improve our observation potentials and the likelihood of the path we infer.

And one thing I was interested to see recently, when I was looking on Bing to get directions, it does include negative information but sort of at a high level. If you see this street corner, you've gone too far. And they do include some site information like you'll pass a Pizza Hut. It's fairly restricted. There's a lot of room for visually salient things, like you'll go past the car wash with the pink elephant, things that people pick up on when they're doing directions.

Now everything I've talked about so far assumes that someone's written down a set of directions for and you raced off and left it on your desk and you have to go follow them.

In a lot of cases when I go into a new office building or something like that. I'll ask someone for directions. I'll write them down then I'll say do I go past the gym, do I see the information desk on my way. And I have a little bit of opportunity to ask a couple of questions before the other person's busy or before I need to go try and follow them.

>>: I would guess if you give directions can actually be followed to get you there, maybe it is your text, have you tried the general public whereas the things can get so vague that nobody can follow the directions.

>> Emma Brunskill: I think at MIT they're pretty vague too. It's pretty bad. Just this morning I was trying to get here, asking someone for directions. What I personally try to do normally is I can't store them all in my head. After about three I just ask again. So I'm always doing greedy direction following and I only go a couple in advance before I have to ask.

>>: So this robotic [inaudible] you mentioned earlier once you get direction then Markov random infers what kind of regions you will have, so how do you schedule the robot here?

>> Emma Brunskill: So then what happens is the robot, I don't know if you saw the wheelchair video at the beginning, but then we have an automated.

>>: Robot, similar kind of --

>> Emma Brunskill: We have an automatic wheelchair which we have a nice demo where it interprets the directions and then it just takes the person there.

>>: Oh.

>> Emma Brunskill: Once it knows what the physical path is. Maybe that's not your question.

>>: The question is the difference of Markov random field, what gives you the sequence of the region.

>> Emma Brunskill: How do you do within the sub map?

>>: Yes, given that inference, how do you actually navigate it?

>> Emma Brunskill: So our assumption is that you have primitive sort of, you already have in place primitive motor control to get you in between the sub maps. So that if I know that I'm supposed to -- so let's say this is a sub map and the hallway is a sub map, that you could tell the robot, you know, transition out of this sub map to the hallway and it would already know how to do that. So for something like a Roomba that would be hard, but for something like an automatic wheelchair, that would be pretty easy.

>>: Camera?

>> Emma Brunskill: Our wheelchair has a camera and a range finder.

>>: Higher level control.

>> Emma Brunskill: Exactly. This is at a higher level control. This gets me to another part of this talk, which is how can we improve these results by asking questions? So the idea is just that this is an iterative process.

>>: You mentioned -- you talked about having and not having a map. It's been estimated one half of your [inaudible] user map. Again, MIT the bias sample, and do you remember that?

>> Emma Brunskill: Like they would find it hard to follow a map? Right. So that -- we're not trying to provide directions for people yet. But that's definitely an important issue. Even if you don't necessarily know how to read a map. I think people still navigate off of directions off of landmarks a lot and use geometry, but I'll talk a little bit about that at the end and also what you do if you don't have a map.

So here we're imagining we get to ask a couple of questions and that sort of gives us a new distribution over what we think our paths are. So we're going to start with a pool of potential paths.

For example what perturbing normally does it maintains the single most likely path or outputs the most single likely path. Now we're going to output say the top 15 or 25 likely paths and it will give us a pool of potential physical paths through the environment.

Then what we're going to do we're going to assume that only one path is correct and there's a probability associated with each path of actually generating the observed directions. So maybe a priori, this is the most likely path, and then these other ones are a little bit less likely.

So what we can do is given that sort of, those probabilities, we can compute the initial entropy over initial paths, we can determine does it already seem like there's one single path that's much better than everybody else or is it kind of midling and we're not sure.

What we can do we can think about selecting the question that will reduce the expected entropy the most. We don't know what the answer is. We don't know what somebody is going to tell us, but we can compute under each possible answer what our new distribution of paths would be.

And we're going to have questions like do you see object X along the path. So, for example, these sorts of environments it might be something like do you go pass the linguistic office or will you go past the gym. In general object X can be any type of landmark.

And then we're going to assume that people answer truthfully. Now in practice this might be wrong, even if people aren't trying to be adversarial, because people forget certain items along the way or maybe they're like, oh, I think they go by that and you leave they immediately realize no you don't actually go by the gym. In general we expect this will probably have some noise but to start we assume people answer these questions truthfully.

And we're going to evaluate this by looking at what's the destination of the most likely path after each question compared to if we don't get to ask any questions, and also at human level performance. Now we're not allowing humans to ask questions in this case. But what we're trying to do is get a sanity check of can we do a lot better or can we at least do somewhat better if we get to view this as an interactive process and ask questions compared to if you just have a static situation where you get a set of directions and have to infer the physical path.

And the answer is that at least in our initial experiments it certainly looks like asking questions helps. So in this environment people got 55 of the destinations correct to start. If we don't get to ask any questions, our approach gets 52. But if we get to ask one question, we're up to 57. And if we get to ask two questions we're up to 61.

And this is statistically significantly better than asking no questions by a ki squared test. And it certainly seems that the trend is that it's better than people and it's certainly comparable to people.

So this seems to suggest that --

>>: The questions asked.

>> Emma Brunskill: They don't have any questions asked but we wanted to see compared to the baseline of people does question/answering seem to help.

>>: Question, how does a Markov field reduce -- ask question --

>> Emma Brunskill: It's effectively giving you another observation. So it's effectively giving you an additional observation that's going to constrain. You can compute for any question you ask,

say is there a gym, what's the likelihood of seeing a gym along a particular path. It changes the probability distribution.

>>: [inaudible].

>> Emma Brunskill: It's not fair in terms of the people but what we're trying to see is just in general does it seem to make a big difference compared to asking no questions, and is the level of performance we're getting better than just a single person who doesn't get to ask a question.

>>: So who chooses the question to ask?

>> Emma Brunskill: The machine did. The machine automatically computed which is the right question to ask.

>>: So it computes the prompt.

>> Emma Brunskill: It computes the right question to ask to in expectation get the biggest gain.

>>: It can --

>> Emma Brunskill: Yes.

>>: [indiscernible].

>> Emma Brunskill: No, and we're doing this -- so this is in simulation right now. So this is not a human user experiment. This is if you could query about any objects in the environment, what would we expect to be the best. And so this is ongoing work, and there's a couple of things to do with the questions you're just asking that we're currently wanting to investigate.

So the first is we want to consider the cost of asking questions. So right now we just looked to asking one or two or three questions. But in general you'd like to figure out automatically when to stop asking questions. It might be after a single question maybe the answer they gave is only true for one path and you're done. There's no need to ask any more questions.

The other is that there's some cost to asking questions. There's the computational cost for the computer and there's an annoyance factor for the person you're asking questions of. So we'd like to put this within a decision theoretic framework so that we could figure out you know when is it worth it to ask questions and what's the expected value.

Also you can imagine if you're running late it's really important to start to follow the directions as soon as possible. So we'd like to sort of use decision theory to figure out when should you stop asking questions and when is it valuable. And then the other thing that's important that gets at your question is conducting human user studies.

So the first reason for that is there may be quite a lot of noise in people's answer, if there's a large amount of noise asking questions might be less useful. We'd like to evaluate that. We'd like to see what questions people ask if we allow them to ask questions in case it's very different types of questions.

We'd like to in general to see is the noise sufficiently low that asking questions seems to help in the live environment either for people or our automated system.

So something we're just starting to work on now a little bit is starting to think about what if we don't have a map. So in a lot of cases I think it's reasonable to assume that you have a map, that the person that's trying to interpret the directions has a map. So if you're a tourist in a new city, if you're in the middle of a new museum, you can have a new environment but have a map of that environment, if someone's giving me directions within the place I work.

I basically have an internal map and I'm just trying to do interpretation of the directions. So there are a number of cases where that's reasonable. And then there's a number of cases where it's not.

So certainly in cases like search and rescue you would expect that the topology of the environment has changed sufficiently you're unlikely to have a map of it but you'd still like to be able to follow direction to say go around the collapsed building and find the crane.

Another situation is when you have limited time. So, for example, sometimes I'm sure nobody else here has a problem but sometimes I'm running late. When I'm running late to try to get to an appointment I ask someone rather than going on to the computer to find the floor plan. I say how do I get to office 317. So in both of these cases I might know a little bit of information. You might see the immediate area around you, but you won't have a complete map of the environment.

And so what we're thinking about now is incorporating our approach when you only have partial maps of the environment.

What we're thinking is you would have a map with a number of frontiers beyond which you don't have any information. The reason this makes the problem more challenging is that you have to divide your verbal direction set into parts that refer to the environment that you know about and parts that refer to hidden parts of the environment you haven't seen yet. So there's this sort of cut point inference you need to do as well.

The idea to start with this is to infer the cut point and follow the directions until you reach that frontier and then explore a little bit in that frontier and you'll see more of the environment and then repeat this process. So it wouldn't be a complete global inference. It would be a sequence of greedy local inferences in order to follow map directions.

Now people have certainly thought about giving directions before in the past. Perhaps the most similar work was by Matt McMahan, he did his Ph.D. thesis at UT Austin. He thought about direction giving but mostly in virtual environments where you don't have the same type of spatial environments as we did.

He had these virtual environments where he set up a number of landmarks which weren't typically standard spatial landmarks, an easel or a large picture of a butterfly, and saw how people navigated in this environment and how you could do that. And while it's very interesting, it's using a very different set of sort of features and representations of what we are and you can imagine perhaps combining some of the different approaches together.

[inaudible] a few years ago thought about local human robot interaction. Things like robot go to the pillar where they're thinking about interpreting directions but not normally a sequence of directions, more of a single set of directions.

And then there's also been some work on autonomous wheelchair and interpreting directions for those by Gubble and all and Muller and all. But most people aren't thinking about it as a global inference process.

So to conclude, we think that sort of posing direction interpretation as global inference is promising. We've got some encouraging initial results. And one thing I wanted to mention is that -- oops. Apparently automated that. It's just that in addition to this work I'm interested in a number of other areas and including reinforcement learning, [inaudible] and partially observable domains and most recently my post-doc work is going to be about machine learning and optimization for emerging regions technologies so I'll be happy to talk to anybody about those things. And thanks for listening.

[applause].

You guys have any questions?

>>: Yeah. So at one point you were talking about reasoning over a bunch of different possible paths, entropy and things like that. It seems like another way to assess the reasonable, to assess which might be the right path is to look at how reasonable it is. I know people give directions, give you the fast way or the easy way, right? So the fast way is not always the best.

But usually they don't have you grouping way off and taking a really weird route. That seems like that would be another way to --

>> Emma Brunskill: Some metric of that, like metric of complication or metric if you're doing, we're not incorporating any notion of distance like how it takes or traversability or anything like that, and it would be nice to have that.

>>: What's the [inaudible] forced learning?

>> Emma Brunskill: Not for this. For separate things. This is just other aspects of my research.

So I was mentioning just before this, one thing that I was interested previously large level sort of mapping and navigation applications is just thinking about variance when you're giving directions.

So if someone said well do you want to go -- do you want to definitely be there in 10 minutes or probably be there in five minutes unless class just got out and it's really busy.

And so, yeah, thinking about variance in terms of the directions you see give.

>>: This work reminds me of the GPS. So if you have a lot of information about maps and you follow what GPS is doing and just getting the destination you can find the shortest path.

>> Emma Brunskill: Right. The GPSs definitely do that, too. Not normally incorporating, GPS is giving you directions as opposed to receiving directions. So we're getting like verbal descriptions of directions but you can imagine feeding this into a GPS system or if you wanted to give directions out, I think one thing that we've taken away from this is that if you're going to automatically give directions to someone, you should incorporate things like negative information and highly salient landmarks, because they'll be helpful to people when they're following those directions.

>>: So how much detail do you think the parsing should be in order to capture the full information of the verbal [inaudible].

>> Emma Brunskill: It's hard to say. So my colleague Tom and another colleague of his,

Stephanie, who does a lot of stuff on speech, is trying to think about much more rich representations like spatial descriptive clauses and things like that which we're not incorporating here.

I think it would certainly help. I think there are sort of these rich notions of observations that depend on viewpoint. I think sometimes people say go slightly to the left or around things. So there's certainly a lot more geometry and types of spatial relation you can incorporate.

>>: To what extent do you change the mode to cover the information?

>> Emma Brunskill: That's sort of an open area of research. That's not something I focused on, I think you'd probably want to push most of that into the transition probabilities and thinking of those observations as changing the transition probabilities.

>>: Changing the linkages.

>> Emma Brunskill: Changing the linkages, we're doing that right now for right and left basically saying the viewpoint of left and right is changing, what's the probability of going between regions regardless of any landmark observations.

>>: Do you have an automatic way of training those somehow?

>> Emma Brunskill: We don't right now. We just incorporate those by hand. I think one thing that would be interesting, too, that we were talking about briefly about before was that you could also learn patterns. So right now we're just assuming a uniform distribution. That's not right. I mean people definitely have particularly depending on your orientation, if you're facing forward you're most likely to be going forward, you can imagine learning those from data collected to have a more rich transition type model.

Thanks.

[applause]

Download