Eric Horvitz: It’s my honor today to present the... >> actually a job interview talk. This is coming a...

advertisement

>>

Eric Horvitz: It’s my honor today to present the host, Andrey Kolobov. This is actually a job interview talk. This is coming a little bit off cycle for the typical hiring season pattern. We thought we’d take a first crack at Andrey before he

[inaudible] other organizations. And Andrey was a star intern on our team two summers ago and we’ve known him for years to our association with the

University of Washington.

He’s been doing some really great state-of-the-art work in decision-making under uncertainty in situations where we think about sequences of actions and some notion of optimizing a long-term goal. Typically this is in a tractable space, as we’ll hear about, and lots of good work has been going on to make these problems more tractable.

Andrey did his undergraduate degrees in math and computer science at

Berkeley, came to Microsoft to do a two year stint as a [inaudible] manager, I guess, and now –

>> Andrey Kolobov: I was a developer.

>> Eric Horovitz: Oh, sorry about that.

[laughter]

As a developer in our top [inaudible] team and then went off to the University of

Washington to do his doctoral work. He’ll be finishing up probably, I’m guessing right now as one of his committee people, early winter.

So we’re getting an early look. Thanks, Andrey.

>> Andrey Kolobov: Thank you for the introduction, Eric. So as I said, I’m a fifth year, have been in my fifth year of the [inaudible] program of the computer science department at University of Washington. And my talk today is going to be about the main research topic planning around uncertainty about the kinds of real problem that we’re going to solve with models of planning under uncertainty, with the challenges within this field, and about the work I’ve done to address these challenges.

So first of all, what is planning under uncertainty? Planning under uncertainty generally involves an agent, say a robot, that’s acting in the environment, and what happens here is this agent gets some sensing about the environment, some percepts about the environment, and based on these percepts it’s trying to select a sequence of action.

So every time it gets a percept it’s asking itself what’s the next thing I should do?

Chooses an action, executes this action, the environment changes and essentially the process repeats. Okay?

Now, planning under uncertainty we allow the environment to be dynamic.

Okay? We allow things that essentially are outside of the control of the agent to happen and we allow the agent to reason about these such things.

Moreover, we assume that the agent’s actions have stochastic effects. So the agent knows what it can do, it knows what can possibly happen if it does these things. It knows the probabilities, perhaps, of things happening when it does these things, but it doesn’t know exactly what will happen before it executes a particular action.

Okay. Moreover, what we want to reason about a sequential decision making process here. Okay? We are assuming that the agent is not myopic is not just interested in the very next action but is trying to come up with a sequence of actions, or policy, to execute and it’s interested in optimizing some quality of this action sequence in some way.

Regarding sensing, we in this talk are going to assume that the agent’s sensing is perfect. So the agent’s sensors have no noise and give it perfect information about what it’s trying to sense. Now, this is not a universal assumption that’s made in planning under uncertainty. There are models there that can operate under the assumption that sensing is imperfect, right?

But the advantage of assuming that it is perfect is it lets us model many real life scenarios and make models more tractable. We also assume that the environment is fully observable to the agent. So not only ar e agents’ sensing perfect, but also the agent can get to know everything there is to know about the environment.

Again there are partially observable [inaudible], which also fall under planning under uncertainty that don’t make this assumption, and I’ll briefly talk about them as well, but the advantage of making this assumption is again it makes our models more tractable.

Okay, so what’s the main question in planning under uncertainty? Well, what we’re interested in is giving the agent the tools to optimize the sequence of actions that it’s choosing in this dynamic environment according to some measure. So if the agent is trying to get to the goal, the agent might want to try to optimize the amount of time it takes to get to the goal, the cost of getting to the goal, and so on.

All right? What problems can we solve under these assumptions? Here are a few, just a few. One, which is actually currently being worked on at NASA, is trying to land a damaged aircraft. The idea here is to help the pilot to land an aircraft that has been damaged or has sustained some system failures to land

safely in the presence of uncertainty about what the aircraft can do, about the weather at the nearby airports and around the aircraft, and so on.

For something completely different, crowd sourcing. By the way, here typically don’t assume full observablility here. People model it with [inaudible]. So in crowd sourcing you typically have some task, and the task can be widely different from transcribing audio records to [inaudible] tasks. And you have a pool essentially of workers that are not particularly skilled at anything except for what humans are naturally skilled at.

And what you want to do is break up your large task into small tasks so that you can then recombine the results of the work of these human agents into a high quality result.

So one example, something that’s actually being worked on here at Microsoft is optimizing the classification tasks on the citizen science project called Galaxy

Zoo. Okay? Here you want these workers to classify galaxies for you based on the galaxies images, and you want to try to have your classification result that is actually responsible while hiring as few workers as you can.

Again for something completely different, and this is a project by the way that I was personally involved in during my internship at MSR is scheduling operating system updates. The idea here is there is some default way of scheduling updates essentially by defaulting Windows there download it something like 4 a.m., which is unsatisfactory in many senses.

For instance, you don’t, if you have started some important job to run overnight, you don’t want the computer to just download updates and restart the machine in the middle of that important job. So the question here is how do we schedule OS updates in a timely manner before their deadlines, but at the same time without interfering with the user.

Okay? Otherwise the user will look like this. Okay. Now, as any research field, a live research field, planning around uncertainty has challenges, but life would be very boring without them. The first one is as in many other areas is the curse of dimensionality. As we will see, the standard methods for coming up with policies under uncertainty essentially try to reason about most or all states of the environment.

And obviously, you know, the dozen [inaudible]. So how do we as humans handle these problems? I mean, we can handle pretty large problems of planning under uncertainty. Well, we reason about the problem structure. We recognize some patterns about the world and we try to reason about the world in terms of these patterns.

And in the work that I’m going to show you I’m [inaudible] methods that essentially try to do that as well. Try to reason about the patterns of the problem.

The second issue is that even the models that we can solve moderately efficiently at this point, they’re still too restricted from any real problems.

All right? They don’t give us a theoretical principles way of reasoning about things like dead ends, for instance, or up till recently they haven’t let us do that.

So what we need is more general models, but also efficient algorithms for solving them, because just the more expressive models, if we can’t solve them don’t do any good for us.

And, again, that’s the challenge that I’m going to talk briefly about at the end of the talk. Okay, so here’s the outline for the rest of the presentation. First I’m going to give you a brief overview of Markov decision processes and then explain how we can deal with curse of dimensionality by coming up with these methods that extract problem specific structure in an automatic, problem independent way and use the structure to solve planning problems.

As a result of reasoning ab out structure these methods are efficient, as I’m going to demonstrate. Next, I’m going to talk about extending the existing models and briefly talk about why we need to do this, why this is difficult, and the recent progress in this area.

And finally conclude talking about things that I’d like to do moving forward. Okay, so first a brief intro to Markov decision [inaudible], right? If using full observability this is really the primary way of modeling planning scenarios and to introduce it, and also the algorithms further down the talk. I’m going to be using a small problem example essentially. Now, don’t be deluded by the simplicity of the planning of this example because the methods that I show you are actually capable of dealing with something much bigger and serious.

So here is the example. Okay, we’re going to model a rational agent, a gremlin, that’s trying to sabotage an airplane. So gremlins are apparently these creatures whose soul purpose in life is to sabotage airplanes, all right? Th at’s what

Wikipedia says in any case.

>>: [inaudible]

>> Andrey Kolobov: So to achieve its objective the gremlin will need some tools first, and then I’ll explain how using these tools that it can pick up it can actually sabotage the airplane.

So, formally speaking, an MDP is this [inaudible] consisting of several components that try to describe various aspects of the problem you are trying to model, and the first of them is sort of state variables.

So the variables that are somehow relevant to describing the state of the environment. In our example they’re going to have five of them, we’re going to assume that they’re binary, we’re going to have variable for whether the gremlin is alive or dead. The objective of the gremlin will be to sabotage the airplane.

Well, actually not dying.

We’re going to have a variable for whether the airplane is intact, and for whether the gremlin is holding three tools, the wrench, screwdriver, and hammer. Okay?

Now, the set of variables implicitly induces state space. Now, state space, really, is just a set of vectors that are conjunctions of these variables’ values. Okay.

By the way, in the rest of the talk I’m going to refer the assignment of a value to a variable as a literal, like literal in logic except it can have more values than just true or false.

>>: [inaudible]

>> Andrey Kolobov: So in this example yes, but in general they can be, the discreet value or even continues. We’ll have initial state. We’ll assume that we know the initial state althoug h we really technically don’t have to assume this. In this case the initial state is the gremlin is alive, the gremlin is intact, and the gremlin is not holding any of the tools. All right. And then we’ll have a set of goal states. States where the agent is ultimately trying to get, which in this case is any state in which this conjunction literals halts, okay?

The gremlin wants to be alive and wants the airplane to be sabotaged. Okay.

Now, on to the actions, on to what the agent can actually do. In this sort of description you give actions by specifying templates, each of which has two big essentially important components, and the first component is preconditions.

The precondition, again it’s a conjunction of literals that tells you which states the agent can use this action. So essentially the agent can use these actions wherever these conjunctions of literals halt. So the gremlin can’t pick up a wrench if it’s not holding the wrench already and is alive.

The second component is what can happen when you execute these actions, the actions effects. Again there is a conjunction of literals that tell you where you transition, whether the agent transitions if it executes the corresponding actions.

In this case, in the case of these actions, the gremlin typically ends up with the tool in its paws except when it tried to pick up the hammer, the hammer is large, the gremlin is small, and so it may fumble and drop the hammer. That’s one possible outcome of the action. The other possible outcome is that it succeeds in picking up the hammer, okay?

So this is the set of actions, this is the set A and then the B couple. T is the transition function that tells you with what probability the corresponding effects can happen. In the tiministic [phonetic] actions, essentially like picking up a screwdriver or a wrench, the sign is a probability of one to the corresponding transition, and in case you have subtle effects while the signs of probability do every possible transition. Okay?

Just to --

>>: Did you really need the notion of preconditions or can you just think of it as a giant three dimensional matrix that tells you the probability of [inaudible] one state to another state when picking an action?

>> Andrey Kolobov: You absolutely can, but specifying in this form lets you convey the structure of the problem to the planner, okay? Which is very useful.

But yes, indeed in many formulations you don’t assume that you even have preconditions.

So again to complete the example once the gremlin picks up the tools, it can use them to actually sabotage the airplane. It can use the screwdriver and wrench to unscrew a few bolts and make the airplane dysfunctional, or it can strike the airplane with the hammer a few times and damage it this way, but the possible unfortunate consequence of this action is that the fuel vapors detonate, the airplane blows up, and the gremlin dies in the process.

The final component of the MDP specification of costs, the cost function essentially tells you how much it costs to execute an action in any given state. In this problem we’re going to assume that all costs are just one, okay, for simplicity.

Now, given this MDP description, what we ultimately want. We want a specification of behavior for the agent to follow in any state of the world. All right? And preferably we want this behavior to optimize some objective like minimizing the expected cost of getting to the goal.

So a formal description of this behavior as a policy, which is a mapping in this case we’re going to consider mappings from states to actions. Okay. Now, how do we solve this beast?

The standard methods follow the following general kind of approach. They compute the value function. And [inaudible] value function is really something that tells y ou what’s the optimal expected cost of getting to the goal from any state. It turns out this is at least conceptually very easy to compute.

All right? And once we have that, as it turns out, it’s very easy to derive the optimal policy from it Pi star, a gain the details don’t really matter, the important

point is that this is very easy once you have the optimal cost to go. You can kind of already see why these approaches fail to scale.

If the number of states in your environment is exponentially the number in the size of the MDP description in the number of state variables, right? So you can’t really scale too far. And the thing is if you want the optimal value function but this is really as well as you can do if your variables are discreet, right? Some functions are not really representable exactly by anything other than a table.

So if you want the optimal value function you’re kind of stuck with these methods and that’s exactly what methods like value duration, policy duration belongs to

RGDB and so do. They try to compute this optimal cost vector and fail to scale.

So ultimately the only thing you can do if you want to solve big problems is really to approximate. You need to sacrifice some aspect of the solution quality and get an approximate solution.

And there have been two major paradigms for how to do that in solving MDPs.

The first one is called disimmunization [phonetic] and, as its name implies, essentially is consists in relaxing your probabilistic [inaudible] problem by throwing away all the uncertainty about actions’ effects. I’ll show you one way to do this.

So what this achieves is while it turns the probabilistic problem into a deterministic one. Now, why would you want to do this? As it turns out, classical planners, deterministic planners, the ones that solve these kind of problems, they’re really, really fast, okay? They are much more scalable and state of the art in probabilistic planning.

So as a result what you can do is the following. If you don’t care about computing the policy for the entire state space, if you just want to plan as you go, plan and line, what you can say is so I’m in this state, right? I want to select and action. I’m going to [inaudible] my problem and I’m going to pass it to the classical planner.

I’m going to ask it give me a few plans from the state where I am to the goal in this determinized [phonetic] version of the problem, right? It does this very quickly and then based on the action statistics over the plans that it gives you, you somehow select the action that you want to perform in the current state.

>>: So it’s a [inaudible]

>> Andrey Kolobov: It’s a -- I mean you can view it as greedy. It’s just very, very crude. Okay? That’s the way to view it. Yeah?

>>: [inaudible]

>> Andrey Kolobov: Yes, in this second we do. Yes. Although, again later on I’ll talk about what to do when they are not or you can’t really tractably use them.

>>: But determinization from any state on any action gets to just one other state, right?

>> Andrey Kolobov: Exactly. Exactly. So if an action has a catastrophic effect you’re not taking this into account when you’re doing this, and that’s exactly the weakness of this approach. And if you’re doing this blindly you’re going to have problems dealing with actions that have catastrophic consequences simply as a side effect set you on a very expansive path to the goal. Yeah?

>>: Have people tried to do multiple samplings to do different determinizations where you said instead of --

>> Andrey Kolobov: Exactly. People have tried improving on this, okay? The right approach is that this is kind of just the general idea, we’ll have tried to improve it but even in those approaches you lose something. Like, if you do multiple samplings for instance, well, the method becomes much more expansive because you have to solve so many more problems for each state you’re in.

Okay? Now, the other kind of general classic approach that people have been using is function approximation. Really it’s the kind of dimensionality reduction that is probably familiar to a lot of you from machine learning. The idea here is that these methods that I’ve described they really solve problem in the number of

[inaudible] that is equal to the state space, right?

They learn the value function, right? It has a number of parameters equal to the size of the state space. So the idea here is to pick a few basis functions, okay, just only a few. And this basis function can be viewed as giving you some information about possible optimal value of the state. Okay?

And then you approximate the optimal value function somehow with a combination of the basis function, for instance by taking a linear combination of them. So the only parameters that you need to learn here are these weights, okay?

And there are a few basis functions there are only a few weights. Now, the problem with this approach, especially in domains with discreet variables where you can’t put any kind of natural metric on the state space is that to get meaningful results you need to have a human engineer look at the problem and figure out where the meaningful sort of meta-features of the problem, okay?

Then code them in the set of basis functions.

So here I have these two kinds of paradigms with their own serious disadvantages, one is not so good at dealing with probabilities with uncertainty, the other one involves a human in the loop.

And what the work that I’ve done achieves is it marries these two approaches to get rid of the cons of both of them essentially and to come up with something that can reason about problem structure code and basis functions in a problem independent way by extracting basis functions in a problem independent way, and because of that scale to much larger problems.

[inaudibl e]. All right? And what I’m going to show you next is how this is actually done. All right? Before we get into details I’d like you to think about how we as humans reason about planning problems, right? So probably most of you buy a fact that we reason about property states, but how? Okay. So suppose you need to go grocery shopping, right, and suppose that your car just doesn’t have enough gas to get to Costco.

Okay? Now, in a sense the state of the world isn’t so good, right? Because getting to the shop is kind of expensive. First, you need to get to the gas station, then you need to pay for the fuel and so on, right? So what you want to do is you want to go to a cheaper state, a state that’s better where your car does have gas, okay?

So the w ay we reason about states of the environment is if we’re in a bad state, if the property of the state tells us that the state is bad we want to go to a better one. Okay? And that’s essentially what these methods are attempting to achieve. In particular, the question is what are the good properties of the states, what are things that tell us good things about states and what are things that tell us bad things about states?

So in this work we’ve considered these properties that take the form of conjunctions of literals like the one in pink here where the gremlin is alive, the airplane is intact, and the gremlin is holding the hammer. And as it turns out when [inaudible] very particular meaning, okay?

One properties, one conjunction of literals such that if such a conjunction of literals holds in a state, here are three states where that conjunction literals hold, okay? We want the guarantee that we can reach the goal via some sequence of actions from any of these states, okay?

So intuitively these properties as conjunctions of literals describing these properties are good in the sense that they tell us that you can reach the goal, there is a possibility that you will reach the goal from any of these states. All right? And since we are in the, we ultimately want to turn these properties into basis functions with each of them we’re going to associate a value essentially,

the basis function is going to say that it is going to have value one in any state where such a property holds.

Okay? And a value ze ro otherwise. So for each such property we’ll have a basis function.

>>: [inaudible]

>> Andrey Kolobov: Yeah. Not for certain, but with a possible probability. Yes.

Yeah?

>>: The course that you have to get to the goal may exceed the benefit when you reach your goal in some settings because it’s just taking so long for me to get my goal state that path is actually undesirable.

>> Andrey Kolobov: Exactly.

>>: But here I’m labeling it as a good thing.

>> Andrey Kolobov: Well, good only in an intuitive sense. In a moment we’ll explain how we can actually distinguish between these good properties even by assigning weights to them that are going to tell us that this property is really good in the sense that we can reach the goal in a very cheap way, and this other property is good but it’s not so good in the sense that reaching the goal was expensive. Okay?

But for now this is the general form of the properties that we want to use.

Of course we also want to know makes a state bad. Right? We want to know from which states reaching the goal is impossible, right? And for that we’ll have a separate kind of properties, we’ll call them no good properties, also conjunctions of literals, in this case we’ll have a [inaudible] conjunction, when the gremlin is dead.

And again, for these we want to guarantee that if a state has such a property, here are three states that have this property; we cannot reach the goal with any kind of possible probability.

All right? In this case it’s sort of easy to see that no good properties as the ultimate causes of what makes a state bad. All right? Here when the gremlin is dead obviously you can’t do anything, okay? And again we’re going to associate with each of them a basis function that’s going to have a value of infinity if the property is present in the sate and zero otherwise.

Okay. Now, that’s fine. We know what properties we want. How do we actually get them in an efficient way? Remember that was sort of the weakness of the

dimensionality reduction approaches in that you needed a human to come up with the basis functions.

Okay? It turns out we can come up with both of these kinds of basis functions that I’ve described automatically. And here’s how we come up with the good properties. Okay? Th at’s exactly where the classical planning paradigm comes in.

What we do is we turn the problem into a deterministic version. All right?

Multiple ways of doing this, we’re going to follow the most straightforward one, which is you take each probabilistic action and you split it into one deterministic action per effect, okay?

So in the case of this action we’re going to have one action, which helps the gremlin achieve the goal for certain, and one that the gremlin just for certain dies.

As you may imagine this action is pretty darn useless.

Now, how do we actually extract the properties from this classical planning problem? What we do is we pass this problem to a deterministic planner. Okay?

The same way as if planners like [inaudible] do under the determinization paradigm. And we’re going to ask the deterministic planner to find for us a plan to the goal form the state that we’re interested in evaluating. All right?

So if we pass up the initial state it’s going to come up with something like this, just a linear plan to the goal. Gremlin picks up the hammer and strikes the airplane, all right?

And what we’re going to do afterwards, that’s the key step in the process, is we’re going to regress this plan. If you know what regression is in machinery, that’s not it. What this regression does is it essentially takes the conjunction, which characterizes the goal, and it’s trying to figure out what do you need, which of these literals did you need at the previous step and at the last step in the plan to achieve the goal from the last step.

So what did you need to execute this section in the state? Probably what it does is it takes the goal conjunction, removes from it the effects of the last action and adds the preconditions of the last action, all right?

So we can do this process to the last action in this plan, what you arrive at is, well, it’s just a precondition of this action. That should be sort of pretty obvious.

At the previous step the gremlin must be alive, the airplane must be intact, and must be holding hammer. Okay?

You continue the regression. You roll back the next action by treating this conjunction as a subgoal in a sense, and where you arrive is while at the previous state to execute this action in brown, the gremlin must have been alive,

the airplane must have been intact, and the gremlin must not have been holding the hammer already.

All right? Now, notice these conjunctions of literals do have the guarantee that we want. If you have such a conjunction of literals in a state you know that you can reach the goal via the suffix that you regressed to obtain this conjunction of literals, right?

In other words, these conjunction of literals are exactly the properties that we want and extracting them in this way lets us get the good basis functions. Right?

Moreover, remember we will ultimately want to learn weights for these basis functions and just as initialization we’re going to give each basis function the weight that corresponds to cost of the plan that we used to regress [inaudible].

In this case all actions have a cost of one, so this basis function gets a weight of one and here this one gets a weight of two. So intuitively you are now setting the lower the weight, the sort of better the basis function is supposed to look.

>>: [inaudible] in the previous slide or two what you did was determinized the problem by taking each action and creating two actions, one where it had effect A and the other where it had effect B, but not paying attention to the probabilities that those two --

>> Andrey Kolobov: Yeah.

>>: So could that cause problems in the second step if you end up basing a plan or deciding on good states because something looked good in this deterministic version but actually it’s extremely improbable?

>> Andrey Kolobov: So again, for now we just extracted these good states.

Now, we’re actually going to learn the weights of these states and that’s where this probabilistic information that we’ve discarded --

>>: Or put differently, could you miss some good states? Like, you might not find all the good states here, right?

>> Andrey Kolobov: No, typically if a classical planner gives you, does manage to find a plan, which is pretty much always if a plan exists --

>>: But there may be many, many plans.

>> Andrey Kolobov: Sure. But it will give you some.

>>: But what if it doesn’t give you ones that correspond to actually likely outcomes because you ignored the probability?

>>

Andrey Kolobov: So, yeah. So one answer to this question is you’re going to miss out on something. But the other answer is there are several ways of doing determinization, and one way in particular is essentially when you do the conversion, when you split the actions, you modify the costs of these outcomes by dividing them by the probabilities for instance.

So essentially outcomes that are expensive or have low probability, or both, they’re going to look like really expensive actions in the deterministic version of the problem here. Yeah.

Again, that’s a possibility and you can easily plug this in to this. Yeah, please.

>>: [inaudible]

>> Andrey Kolobov: Yeah. So in fact, if you notice the -- in principle the number of basis functions could be three to the power x where x is the number of variables. In fact, in theory there could be many more basis functions then there are states.

But the number of meaningful ones is typically very small, that’s one thing, and the other one is that as you will see we’ll get basis functions on demand.

So we’ll only get them as necessary. And the nice thing about each of them is that once you get a basis function you typically can evaluate a whole bunch of states. So you don’t really need more basis functions for a whole bunch of states and that’s essentially what’s going to keep the number of basis functions under control.

Yeah?

>>: You’re showing the determinization. So you’re training for one action. Like, do you determinize the entire chain? Is that sort of exponentially the length of the chain when you have two connections, or what does it --

>> Andrey Kolobov: Well, no, because our actions are not chain. So given the description of an action with the condition and the effect, for each such template so to say you split it in two.

>>: So you look at each of them separately.

>>

Andrey Kolobov: Each action is separate, exactly. So you don’t get exponential blow up. This process is essentially linear in the number of actions or actions outcomes.

>>: I’m just sort of curious as to where this is going. So the original algorithm is guaranteed to find the optimal answer in exponential time. When you’re done

with this will there be another time trade off that, a time quality trade off that you can guarantee?

>> Andrey Kolobov: So in terms of guarantees, no. [inaudible]. It works well in practice. Well, empirically we’ve done a lot of tests compared to other methods.

It produces better results then other proximate methods, it works faster than them, uses less memory, but other than empirical guarantees you can’t really say much.

>>: And does anything have any guarantees?

>> Andrey Kolobov: Typically not. So there are some dimensionality, when you are doing approximations, if you wanted to be efficient, essentially you should be prepared to lose out on guarantees. There are some dimensionality reduction methods that say that if this method halts, then you have some kind of guarantees.

Yeah?

>>: There is some notion of good [inaudible] because you could imagine a problem in which every action can have every possible outcome with a very local

[inaudible]. So if you have this, everything’s good, except that they’re going to have high cost if you move this [inaudible]. So you are not only looking for good combination of details but good without cost.

>> Andrey Kolobov: Yes, yes.

>>: It’s not just good and bad, it’s [inaudible].

>> Andrey Kolobov: Exactly. So when we are learning weights we are going to get part of this --

>>: You could have a situation where all the states are good because there is a path that has very local [inaudible].

>> Andrey Kolobov: Yes. So in goaloriented settings you typically don’t really have that. Okay? But are going to look later on in reward based settings where that is often the case. Okay? No state looks particularly bad or good, okay, in terms of the long-term reward that you are getting.

And again, for those approaches, or for those problems that kind of a [inaudible] for this class of approaches and for them you need something different, okay?

And I’m going to briefly talk about that later.

All right. So let’s move on. Right. So now about the meaning of the basis function weights. As many of you have already pointed out, you know, problems

aren’t just good or bad, if they’re bad they’re kind of, at least in the sense that we described, they’re really bad. Okay.

You just can’t reach the goal. But if they’re good you may be able to reach the goal via essentially different costs depending on what the basis function tells you.

So here is one example. Here is the state that has one basis function, one property in blue, which essentially guarantees that you may be able to, in this case that you definitely can, reach the goal via this two-step plan, right?

And then there is this basis function in pink that essentially was obtained by regressing the one-step plan where the gremlin just hits the airplane with the hammer, right?

But indeed if you try to execute this plan in the probabilistic world, there is a danger that the agent will actually die, right? So [inaudible] speaking what we want to do is learn basis functions’ weights in such a way that basis functions, like the one in pink, look worse than the ones in blue. All right?

So the details of how to do it are sort of slightly messy. So I’m not going to concentrate on them, but I’m going to show you how the process of extracting the basis functions and learning their weights can be put together to produce a planner. Okay?

And the planner is called the retrASE, which stands for regressing trajectories for approximate state evaluation. So here is how is works. You are given a problem. Probabilistic planning problem model is an MDP.

The first thing you do is you determinize it, all right? You are going to just cache the determinized version of the problem and never do the dimunization [phonetic] process again. And you’re going to cache it in this thing called the extraction module that is going to handle basis function extraction and weight learning for us.

And then what you do is you run a state space exploration routine. Intuitively speaking, what it does is it analyzes various states in the state space in some order, again the order doesn’t matter, looks at their values and sees which states are good in the sense that the values of which states are high and trying to determine the policy, the best action to do in these states.

All right? But of course, so it needs value, to have value of these states, and at the beginning, when the process just starts remember our tool for evaluating states is a basis function and we have no basis functions, right?

So the state space exploration routine runs into a state, realizes it has no basis functions for it, and asks the extraction module, can you evaluate this for me? All right? The extraction module runs a classical planner from that state in the

determinized version of the problem, as I just described, regresses it, regresses the plan that it gets, okay? It extracts basis functions via regression.

And remember, if you start a plan in some state and then you regress it you definitely get at least one basis function that is one property of the state where you started, right? So you have some information to evaluate the state. So the extraction module evaluates the state and passes the value back to the state space exploration routine.

Now, the state space exploration routine does some more analysis and tells the extraction routine, hey what you need to do is you need to update the weights of the basis functions that you use to evaluate the state in a certain way.

So to correspond with our intuition that bad basis functions should get high weights and the good ones should get low weights. Okay? And essentially that’s how the process operates, right?

The one thing to notice is at the beginning when state space exploration just starts, for the first few states you’ll need to run, you’ll essentially need to extract new basis functions to evaluate these states.

But as the number of basis functions that the extraction module already knows about grows, right? It’s growing increasingly able to evaluate states that it’s passed via the basis functions that it already has. Okay?

And again, that’s what controls the number of basis functions. You get them sort of on demand and demand for new basis functions is shrinking as you go along because, you know, intuitive speaking more and more properties of states.

Now, what I’ve swept under the rug so far is that you may run into states that are dead ends from which there is no path to the goal. So the classical plan is simply just going to fail on you. All right? And again, when the state space exploration process starts, you’ll have to run the classic planner in a few such states and memorize a few dead ends.

But at some point remember we have these other kind of properties called no good properties. And no good properties essentially help us identify dead ends.

So at some point the extraction module invokes this machine learning procedure called sixth sense, which helps the planner sense dead ends, right? And this machine learning procedure essentially uses two kinds of training data.

You can view it as positive examples and negative examples. Positive examples are dead ends, those are the states that you want to classify what was bad, and basis functions are training examples that summarize the good states. Those that you want to classify as not bad using the no goods.

All right? Again, once you invoke this procedure a few times you’ll have a few no goods. And with the help of these no goods you’ll be able to evaluate states without running classical planner from them but in a very cheap way using the basis functions.

And by running this process after some time and terminating it you get a policy.

Now, last piece of the puzzle that

I haven’t showed you yet is how to actually get the no goods. How does sixth sense operate?

And that’s what’s coming up next. So first of all, sixth sense is a two-stage algorithm. Okay? And the first stage is going to generate conjunctions of literals that could be no goods, like no good candidates. Then it’s going to test each of them.

And what’s going to drive generation of no good candidates is the following observation. So notice that basis functions and no goods, sorry, the good properties of states and bad properties of states give us exactly opposite guarantees about the states in which they halt.

No good says you can’t reach the goal under any circumstances, and the good property says you can reach the goal somehow. Okay? Since both of them are conjunctions of literals, a no good and the basis function can’t coexist in one state, right?

What this means is that this conjunctions of literals somehow must clash. They must contradict each other in some way. So essentially a no good can be viewed as a conjunction of literals that for every basis function existence contains a negation of some literal in that basis function.

Right? Now, we don’t know all the basis functions, right? We don’t know all the basis functions in existence and we don’t even want to know them because, again, there are many of them, as you pointed out.

So what we are going to do is given the basis functions that we do know, we’re going to essentially compile this set of literals that contradicts every basis function that we know. In particular, there is really only one minimal no good in our problem. That’s when the gremlin is dead. Notice that it contains, well, it consists of just one literal, and this literal is a negation of some literal in every basis function that we know.

These are the two basis functions that we know, it contradicts this basis function and this basis function because both of them say the gremlin must be alive, all right? So what we’re going to essentially do is we’re going to go through the basis functions that we have and sample contradicting literals for each of them.

And that’s going to form our no good candidate. Now, of course since we don’t know all the basis functions, we don’t know if the set of literals that have compiled actually defeats every single one of them, every single one in existence. Yeah.

So we need to test whether this no good candidate is indeed a no good. Again, we are going to do this with a procedure that polynomial and polynomial the number of MDP variables, not states but variables, and sound.

Okay? I’m not going to go into detail, I’ll just say a few words about it, but first let’s look at how we generate the candidate, okay? Remember, besides the basis functions that are partly [inaudible] creating the no good, we also have dead ends that we’ve seen.

And dead ends intuitively, if we look at the literal occurrence statistics in them, they help us identify the literals that are somehow, intuitively speaking, are likely to be part of a no good.

If you look at the literal occurrence statistics we see for instance that this literal that actually forms the only no good that we have in the problem is quite common among them. It occurs in every single dead end. Of course the literal occurrence statistics may be correlated and there are the literals that are also popular in the dead ends, right?

So using these statistics we can make the process of sampling a candidate more efficient by [inaudible] basis function that we have, picking a set of literals that defeated, okay, we’ve already been through that part, Looking at their occurrence statistics in the dead ends and sampling a literal from the distribution [inaudible] by the literal occurrence in dead ends.

Okay? So if we do this in this case we roll the dice and sample from this distribution in red and say we select literal that says that the gremlin is holding the hammer, okay?

Now, we go on to the next basis function that we know about, say this one.

Okay? And we do the same thing. We look at the literals that contradict this basis function, we look at the occurrence statistics in dead ends, we sample from that distribution and say we select this literal that says that the gremlin is dead.

Now, if the basis function that I just showed you is all we know at the time we run this algorithm, then we stop here, okay? We just finished creating our candidate.

All right? And at this point we test it. Okay?

Now, there’s this procedure called planning graph, which some of you may know but I don’t want to go into details of it. The important part is that you can actually use it. You can modify it significantly to perform the following test. You take a no

good and you see if you can, if you must fail to reach the goal from every single state where this no good holds. Okay?

Except if you do it just the way I just described, right? This is going to be very expensive, right, because there are many states in which no good applies. There are some tricks in which you can essentially lump all these states with which no good holds into one and do a test using planning graph from that accumulate mega state, you would say.

And the resulting procedure that you get is polynomial so it’s very efficient in number of domain variables. It’s also sound so if it tells you that if you can’t reach the goal from any of the states where no good holds, you should believe it, okay. Now, where does the trick come from? It’s incomplete.

Okay? So there may be conjunctions of literals that [inaudible] no goods, so con junctions of literals that have the property that you can’t reach the goal from any of the states from which they hold, but this procedure won’t tell you that.

Okay?

But in practice, again, it’s quite effective. Now, the last step in the process is notice that the no good candidate, by the way for this no good candidate the procedure is going to say that, yes, it is indeed a no good. But this no good is not minimal in the sense that it contains information that doesn’t really -- is superfluous, right?

We don’t really care if the gremlin is dead and clenching the hammer in its dead paws, right?

It’s irrelevant. What really matters is that the gremlin is dead. And to reduce this no good candidate further we essentially remove literals from it one by one and redo the testing. Okay?

And in this case we get a no good that’s minimal in some sense. In the sense that you can’t really use it any further. Okay. And that’s essentially completes the description of retrASE and its component of sixth sense and the result is planning algorithm that essentially outperforms state of the art in goal-oriented problem solving.

What does is [inaudible] benchmarks the probabilistic planning community has a whole bunch of benchmarks like the machine learning community, and across them our method essentially beat the previous state of the art. Here’s one example, an example on fifteen problems from a single set.

Here problems roughly increase in difficulty in size from first to the fifteenth, and in particular these last three problems have state space to the hundred states.

Okay. So the number of particles in the universe doesn’t even begin to describe this number. It’s this huge. Now, of course these problems are structured, right?

And the strength of our approach is in being able to extract this structure and reason about it in an effective way to come up with good solutions.

So in particular we compared it to one of the state of the art algorithms that I compare it to is called the FFReplan that I mentioned before. As I already said, so these methods do have problems with dead end states, okay?

If you determinize the problem, as FFReplan does, and just completely throw away the probabilities, well, you’re going to run into bad situations from time to time. However, the advantage is that this kind of approach is very efficient because really it’s just a classical planner running all the time and classical planners are really efficient.

Yeah?

>>: Does FFReplan reweight the paths -- oh, sorry. It’s not doing the split the way that you’re doing so it doesn’t have the opportunity to redo that cost to you.

>> Andrey Kolobov: I mean, you could slightly improve its properties by essentially doing this cost through weighting that I described and -- Yeah?

> >: I’m curious if, like, have you tried that just taking the first part of what you describe where you split every action, did the cost-free weighting, and then just brand a deterministic planner on that to see how it would compare with your

[inaudible]?

>> Andrey Kolobov: Yes, so the caveat here is that if we can solve the problem this way it works better, the caveat here is that up till recently deterministic planners essentially could only efficiently handle uniform action costs.

>>: I see.

>> Andr ey Kolobov: Okay? So if the action costs were different they wouldn’t be so fast and you kind of lose one of the big advantages. The situation has improved since then, but yeah. In any case, so this part I can actually scale to the problems of this size as well. FFReplan can scale.

However, as you see in problems where an optimal policy exists, in this case by the way we measure in the quality of the output in terms of how likely the policy is to reach the goal, okay? And this is done just by simulation.

So in these cases where there is an optimal policy that reaches the goal with 100 percent probability, but there are dead ends, FFReplan does quite poorly, you see? It runs into problems due to discarding uncertainty.

But there is another planner called FBG that does a more principle decision theoretic analysis and as you see, generally on problems where you can reach the goal with high probability it comes up with optimal policies in that [inaudible] determinization [phonetic] criterion.

But the consequence of doing this kind of reasoning that it does is that on large problems it just fails to scale. It just fails to solve them. It fails there completely.

Okay?

And retrASE essentially beats both of them by significant [inaudible] in terms of policy quality across a range of problems. So the other kind of experiment that we did is well, how well does it scale compared to optimal or near optimal planning techniques?

A very famous, at least in probabilistic planning community, optimal algorithm, optimal and relatively efficient algorithm is LRTDP. It’s a heuristic search algorithm. So if you run it with an admissible heuristic it’s going to give you an optimal solution. If you run it with an inadmissible heuristic it’s going to be more efficie nt but it’s going to give you a suboptimal solution, but typically not too bad.

So here we plotted the amount of memory that these techniques use on the Yaxis on the log scale, and notice that LRTDP, in both cases, whichever heuristic it uses it’s leaning on this log scale. So really it scales exponentially with the size of the problem. Okay?

And retrASE essentially its graph is much closer to logarithmic, so it scales polynomial linearly in the cases of these particular problems. Not always, of course, in the size of the problem. Okay.

So just a brief mention of the fact that you can actually use the techniques that’s the reasoning about basis functions that I described to help existing techniques.

Okay? You can use the initialization of values of basis functions, you remember when we regressed the plans to get the basis functions, we’re going to get some weights for them.

You can use these weights as heuristic values essentially to help heuristic search planners. And what you do is well, you simply run the heuristic search planner like LRTDP and you feed to it the value estimates based on the initial weights of the basis functions. Okay?

The result is this heuristic that we call GOTH, Generalization of trajectories heuristic. And empirically, again, we ran LRTPD with this heuristic and with the previous heuristic known to be the most informative in this whole range of problems, and what we discovered is that GOTH empirically typically leads the planned to explore many fewer states and at the same time have comparative or

better policy quality than if the planner uses the other, pretty informative heuristic called HFF.

Okay. So here you can see that GOTH essentially leads the planner to use less states, less memory. Okay? Dots below the diagonal denote less memory.

All right, so, yeah?

>>: One thing. It seems like the [inaudible] of this approach depends on discovering those basis functions.

>> Andrey Kolobov: Yeah.

>>: I’m wondering are there settings where discovering those basis functions are harder than others? For example, when you test the companies nine different domains. When the variables that you have in your state representation have complex dependencies, is it getting harder to discover those basis functions or not?

Like, [inaudible].

>> Andrey Kolobov: So I don’t know of cases where it’s harder to discover basis functions, so this set of problems that we tested on, I mean, the problems there are synthetic but they are a structurally complex.

And they are structurally complex in different ways. So the one possible kind of scenario where this approach may not do so well is the one that you hinted at where you essentially have each action just has tons, and tons, and tons of effects, right?

So determinizing such actions are very expensive. Solving such determinized problems would be very expensive and also among, it’s hard to select good basis functions in those cases. But see such scenarios are fairly rare in goal-oriented settings.

They are in reward-oriented settings when goaloriented settings they’re rare because if an action has a ton of effects, right, and you want to reach the goal, the chances are one of these actions leads directly to the goal.

So the problem becomes, in a sense, simple, simple in this way. But yeah, other than that I don’t really know of really bad scenarios for this.

>>: You then clicked the variables by regressing them by one [inaudible] source.

So existing variables until you have a code that really [inaudible].

>> Andrey Kolobov: So you are suggesting --

>>: You take a subset of the variables, random, and the computer [inaudible] so that’s a new variable. The one random [inaudible], and then you compute a set of new variables that present only permission [inaudible]. S o it’s a kind of encryption.

>> Andrey Kolobov: Yeah, I see.

>>: So you’ve encrypted the structure of the variables behind some random source. So now it’s going to be quite difficult to find the -- any subset, any basis function which is a subset of variable now corresponds to the [inaudible].

>> Andrey Kolobov: Yeah. I mean, if you try to start trying to build scenarios adversarily [phonetic] I mean, sure.

>>: [inaudible]

>>

Andrey Kolobov: Yeah. I mean, definitely it’s just a matter of do they seem to occur in practice and that does seem to be the case. But yes, indeed you can construct such problems.

Okay. So the approaches that we’ve considered so far sort of assume that you do have a goal, that the [inaudible] is actually trying to achieve a goal. But in some scenarios, like in portfolio management, you don’t have a particular goal, you just want to manage the portfolio and keep earning money as long as you can, sort of.

So the processes here are essentially reward or re-enter. You want to get good long-term reward, which is the same as maximizing long-term reward is like minimizing, sorry, maximizing minus cost.

And, you know, that’s actually a subtle optimization criterion because when you cared about reaching the goal in some sort of cost optimal way and you had dead ends, really you had a very convenient sort of proxy optimization criterion there.

Clearly if you are in a life-and-death situation and you want to optimize cost, the first thing you care about is getting to the goal. Okay? Staying alive, right? The second is optimizing for cost.

Here there isn’t such a convenient proxy optimization criteria. You really kind of need to solve the reward optimization problem more or less directly. All right?

So in this case, these sorts of goal-oriented basis functions of the types that we just talked about are less applicable.

I mean, in some situations you can identify high reward regions of the state space to earn these states and these regions into gold essentially and do the kind of regression to obtain basis functions in this way, but in many cases no state is particularly good or bad. Okay?

The distinctions are much more subtle. So you need something completely different here, and I’ve done some research on what are good design principles for these kind of scenarios. I’ve discovered, for instance, it makes much more sense here to plan online, especially if you have huge state spaces and actions with a ton of effects.

Planning online doesn’t always pay of in goal-oriented settings because typically when you plan online you need to, you have very limited amount of time to figure out what to d. Typically you, which means typically you look up to limited look ahead, right?

In goal-oriented settings this may really bite you b ecause if you don’t see the goal, if you don’t have enough time to plan far, see far enough to see the goal, you can fail miserably, right?

And in reward-oriented settings this is less common. Okay? But the two planners that came out of this research is one of them is called Glutton. The reason it’s called Glutton is essentially momentarily eats up all the memory that you have on your machine and uses this memory for some bookkeeping tasks.

It’s not like it runs out of memory, it just manages it in a smart way, okay?

The main idea here is that it’s planning online and it does reverse [inaudible]. As long as it has time it solves the problem for one, two, three, four, and so on, and it’s based on essentially a planner that has a terminating condition so you can always tell when it has come up with the optimal policy for given look ahead.

Okay? And when you terminate it in the middle of planning and it tells you okay, here is the action I think you should choose and this action is going o be optimal for this kind of look ahead. All right?

We submitted it to the international probabilistic planning competition; it’s the kind of competition where, well, as its name implies, a bunch of planners are competing at solving the wide range of problems. It was a runner up. It was a close runner up. And its development, called Gourmand, actually was able to outperform the state of the art. I mean, it came out a few months later than the competition.

It was able to out perform the winner of the international probabilistic planning competition, which makes it the state of the art in reward-oriented settings in some way. Surprisingly, so the winner of the competition was a heavily optimized version of UCT and recently in probabilistic planning it has been this

view that if you have a problem with a really complex structure, like if you have many action outcomes, right?

You may know the probabilities of those outcomes, but really you can’t use them in any kind of closed form computation because you need to essentially, at every step you need to take a summation over all the successors of each action under each state, and that’s really, really expensive.

So you essentially have to resort to something. Okay? Now, UCT seems ideally suited to these scenarios because it lets you do something in an intelligent way.

And surprisingly this method, even though it’s based on more or less of a conventional probabilistic planning technique, it actually beats using this finely tuned UCT version across a range of domains.

And the reason it does is in UCT you have these parameter settings and you really need to tune them really well for it to work. Moreover, these parameter settings are very problem dependant.

So if you want to get a general problem solving tool out of UCT, the best you can hope for is, well, you just sort of pick a set of parameters that seems to work fine across a range of problems. When you have a problem that doesn’t conform to what you’ve tested on, again, you can fail miserably.

Gourmand actually does have some parameters, but it optimizes them as it goes along, it does this automatically, and as a result it adapts the problem that it’s trying to solve, okay?

And because of this, even though UCT was really, really finely tuned, you know there were problems, there was a really wide range of problems, and across the wide range of problems automatic parameter tuning of the kind done by

Gourmand is much more practical.

Okay? And as I said it doesn’t need much tuning. Gives policy guarantees of som e characterization of the policy quality, and if you want to play with it, I’m going to release it soon. I’m going to release the code, so, all right.

So that was essentially most of the talk about the planning scaling up existing planning approaches, now, a few words about extending the existing models.

And the first thing I have to say is that actually the previous MDP definition I gave you was wrong.

Well, it was incomplete, let’s put it this way. Actually if you look at it, as it turns out, the cost of optimizing for expected cost is an ill-defined criterion for that kind of model. And you need to restrict it to make it well defined. Okay?

There are several MDP classes that you’ve probably heard about. Finite horizon

MDPs, Discounted reward MDPs, and Stochastic shortest path MDPs, which essentially attempts to restrict the previous MDP definition to make it wellformed. All right?

But the cost that you pay for making them well formed is you -- well, you fail to model some scenarios that we as humans can handle with. So in particular like stochastic shortest path MDPs, they assume that all costs are positive, which essentially means you can’t have both costs and rewards in a goal-oriented process.

Well, you can model it. And it assumes that there are no dead ends. It assumes that no bad things can happen if you try to achieve the goal. All right? Clearly this is unrealistic. That’s one scenario where bad things do happen.

If you try to land a damaged airplane it can obviously crash and so formally with this model you can’t really model this. You need hacks. Okay?

And so to avoid hacks I’ve done some work on extending the existing MDP classes and coming up with algorithms for them. Now, first of all, why is this hard? Okay? The reason this is hard is that so in the existing MDP classes, like stochastic shortest path MDPs or Discounted reward MDPs, this optimal value function, so ultimately the solution of the MDP, can be obtained very easily by fixed-point method.

So fixed-point methods is called value duration, or policy duration as well. So some examples of fixedpoint methods are Newton’s method EM and so on.

Many of the commonly used methods belong to this kind. And what they do is they start with some guess of what the solution is, apply some operator to it to improve it essentially, and then invariably converge to this unique fixed point that turns out to be the optimal solution to the problem.

As it turns out even small extensions of the existing MDP classes this breaks down. This operator that is used by most of the algorithms called Bellman backup essentially starts having multiple fixed points, only one of which is optimal.

Everything else isn’t meaningful. A lot of fixed points aren’t meaningful. So you somehow need to deal with these extraneous fixed points. The other observation is that in methods for solving the conventional MDP classes we could derive the optimal policy from the optimal cost function very easily.

It was a trivial very efficient procedure. Again, even for small extensions this seizes to be true. You can come up with an optimal value function, but if you try to use the old way of deriving the optimal policy for it you could get a policy for instance that achieves the goal with very low probability or not at all.

And if you have a large problem this is a big issue because you can’t really tell how good the policy is if the problem is large. Really the only way of figuring this out is to try to execute it, perhaps in real life circumstances. And a gain that’s something you don’t want to do.

So as recent progress in this direction, I’ve done some work on extending MDPs to allow a richer cost structure. Some of just positive cost for instance. And that’s captured in classical generalized stochastic shortest paths MDPs. Also I came up with a heuristic search algorithm.

So it’s an optimal algorithm but it’s more efficient than say value duration in the sense that it does manage to avoid visiting all the states in the state space typically. Okay?

And that work led to extension of stochastic shortest paths MDPs that let you reason catastrophic scenarios; about dead end states effectively in the decision theoretically sound way. In particular the major extension is the one that allows you to reason about dead ends if you want to put an infinite penalty on dead ends.

Okay? This is something in fact that even we as humans are not so good at dealing with. If we have an environment, which is very dangerous in the sense that reaching the goal involves a significant chance of dying, and dying to us is like an infinite penalty, then we as humans often get lost.

We don’t really know what to do because essentially every single thing that we do carries an inherent danger. And this extension to stochastic shortest paths

MDPs lets us do this kind of reasoning. And what’s something that I’m currently working on in this direction is stochastic shortest path MDPs that allow both costs and rewards.

Okay? Again we as humans are actually able to deal with these scenarios. You know, if we have goals in life, we have actions that take away resources from us like shopping. We have actions that bring resources to us like working.

And we are able to deal with scenarios like this in current MDP classes can’t.

And t hat’s the work that’s aiming to fix this. All right. Now a few words about what I want to do moving forward.

Okay? First of all, so there is a whole bunch of applications that I’m interested in and in particular the ones that are being worked on here at Microsoft. One of them is power management.

In particular power management computing devices in which I’m hoping to answer questions like when can we turn off various components, you know,

energy hungry components of a computing device, like CPUs, screens, without interfering with the user activities so as to extend a better life essentially.

Another kind of question that also is indirectly related to this is how do we optimize the operation, the operating system, so as to minimize the number of cache bases for instance.

Again, I think that something that decision theory could help a lot with right now these questions essentially are being answered by expert engineers who sit, look at the patterns in the data, and try to come up with some heuristic ways of evicting pages from the cache.

And last summer we did some work that essentially implies that we can do it much, much better. Okay? Besides the obvious impact that it will make the devices greener, you know, it will make the devices require less energy, it will have the obvious impact that extending the battery life will make the devices more attractive to customers.

Okay? It’s, for instance, a deciding factor for me personally when I’m selecting between devices. And of course you know that will make the Microsoft

[inaudible] still happy.

So again the key here is probabilistic planning on uncertainty, but also of course to take the user into account we’ll need a lot of machine learning to actually learn the patterns in user behavior. Ano ther application again that’s being worked on at Microsoft and in being is crawling.

How do you do crawling on the web in a principled way? The issue here is that there are billions of pages sitting in the index and sitting on the web, right? Some of them, many of them actually, they can change. A small fraction actually does change but many of them can change.

And you for some of these pages users really expect you to pick up the changes to these pages promptly, like changes the news pages and they expect you to serve the updated results as responses to queries.

The issue is, well, you only have the pipeline of a certain width. You can only crawl so many pages per day, right? And the question is how do you essentially optimize crawling so that it knows to pick up the most important changes as quickly as possible, right?

So this scenario can be modeled as an MDP or (PO)MDP depending on how you look at it because intuitively what you want to do is you want to kind of balance optimization of two criteria. A you want to get the immediate pay off from picking up new updates, but also as it turns out there is data about how changes in different pages are correlated, okay?

So data of the kind that tells you if this page has been updated recently, then this page has also been updated recently with some probability. So you not only want to get the immediate reward of crawling pages that have been updated, you also want to gather information about what other pages may have gotten updated. All right?

And so balancing out these optimization criteria again, can be done with decision theory. And the huge [inaudible] here is that of course if you try to model as an

MDP in the most [inaudible] way the branching factor, the number of actions that you can choose form is simply enormous, right?

You need to fill the pipeline that you have, say of width N, with some selection of pages from a much larger set M, right? So you have the branching factor is really, really huge.

However, this problem does have lots of structure. In particular the information gathering aspect of it may have similarity structure. Intuitively speaking, as you select more and more pages for crawling, the amount of information about other pages that are outside that you’re not actually crawling is shrinking in the sense that you get less and less added benefit from each page you add to the pipeline.

Okay? And these kinds of structure actually, and the work that I’ve done and work that has been done on actual applications that have similar structure, essentially gives me hope that we can do something really interesting and something really efficient here.

Moreover, there is another problem that I don’t have in crawling that I don’t really have time to talk about in detail, which is figuring out which parts of the web graph to crawl in order to pick up the new pages.

So where are the new pages most likely to appear? Again I believe that’s a job for probabilistic planning. So some other things that I’d like to work on. MDP theory. I’m’ a firm believer that along with the practical applications of MDPs we also need to figure out if what we’re doing is sort of theoretically sound in some way, in some approximate way even.

One I think I’m interested in is learning to optimize variants of policies that MDPs give you because sometimes you’re not only worried about the expected cost of the policy or expected reward, but how badly the policy can fail or how much a reward can give you in the optimal circumstances.

Another one is analyzing what makes MDPS hard. So in inference, for instance, we have such a measure of tree width that tells us how hard the problem is. And

I’m interested in finding an analog of this in MDP.

Transfer learning is another. There is a whole range of problems here.

I’m interested both in theoretical aspects and practical aspects of it. So transfer learning, first of all, is something you use where you have a domain where you have a lot of training data. It’s easy to build the model.

You have another domain where building a model is hard because the training data is -- you know, its costs are very expensive to collect and so on. So what you want to do is you want to transfer some how the patterns of the model from the source domain from the [inaudible] domain into the resource cache domain.

And so far there haven’t been principled ways of doing this, although there has been a lot of work in this field. And I’m interested in how to make this principled and then apply it to fields like medical diagnosis for instance.

Some other work on game theory. How do you make laws more efficient?

These days essentially laws are built in the way that says everybody should do this. Everybody should do the same thing. Okay? This kind of corresponds to enforcing pure [inaudible] if you view the situations that laws describe as games.

Okay? So what happens if we try to design laws that let different people do different things? Okay? Again, some rough ideas that I’ll be happy to talk to you about. Then applications and crow d sourcing. What I’m interested in here is how to build warp flows [phonetic] automatically. Right now they are typically designed by humans.

And how do we scale crowd sourcing to not simply take, you know, not just to perform tasks that are relatively easy in a cheap way, but how do we make crowd sourcing do something larger than life for us? Something that no single human could do in principal.

Okay? And again I’ll be happy to talk to you about any of this. But to conclude, first I’d like to acknowledge many people that I’ve worked with throughout my

Ph.D. and I’ve learned from a lot. So thanks a lot to all of them.

Also a shameless plug. If you want to know more about state of the art in probabilistic planning, at least the way [inaudible] my advisor and myself have recently written a book about it and also have given a tutorial has slides on my webpage and on Mausam’s webpage as well. So check it out.

And to conclude, in my Ph.D. work I essentially tried to address two big challenges in probabilistic planning. How to scale up methods for solving existing models and how to extend the boundaries of what we can model in principle with current tools.

And moving forward I’d like to work on concentrate on both theoretical and practical work on applying probabilistic planning and machine learning the whole bunch of problems.

Thank you.

[applause]

>>: If there are any other questions about anything that has been said.

>>: What do you think about the [inaudible] of these approaches to partial observable domains?

>> Andrey Kolobov: So in partial observable domains you, I mean, sometimes you have a goal, but often these scenarios are sort of reward oriented. So I would be leaning more in the direction of using something like Glutton or

Gourmand in these settings.

Now, I think another promising direction is actually to use UCT algorithm. I said this algorithm has to of parameters that you need to tune, but if you have a concrete problem, if you have a concrete application, I think UCT has -- you can actually tune these parameters.

And UCT does have properties that go for it. In particular, if you view solving a

(PO)MDP as really just an MDP with a continuous state space, right? It’s really a

MDP that’s hard to solve because you can’t perform say Bellman backups in a very efficient way, okay?

And to do you need Bellman backups in MDPs to reason about policies optimally.

Now, this is the same issue that you have in ordinary MDPs that have a complex structure.

And in those MDPs UCT has been applied quite successfully. So if you try using

UCT in (PO)MDPs I think you’ll have great success.

>>: What about basis functions? --

>>

Andrey Kolobov: You can. So again in continuous spaces you’ll, I mean there has been some work in designing basis functions automatically. People would take say a bunch of Gaussians and try to approximate a value function with Gaussians. But yes, I mean you can definitely combine UCT with learning weights basis functions.

Sure. I think in fact some of the competitors in the probabilistic planning competition have done that. Hasn’t worked so well quite yet but, you know, there has been work in that direction.

Yeah?

>>: In pretty much all of your techniques, I think in all of your techniques, you assume you have a model of the world and that that model is correct. There are no errors in it. You cited a lot of examples that you’re seeing examples where this can be applied in the real world.

There you don’t know what the real model is and so you have this exploration exploitation. What do you think about the prospects for taking some of these techniques and transferring them to that?

>> Andrey Kolobov: So again the Gourmand approach that I described, okay?

It’s doing something similar to what reinforcement learning approaches do. It essentially does a lot of sampling. Even if it doesn’t know the transition probability exactly, it assumes that it has some sort of a simulator, real world for instance, right?

And it simply does sampling. So it doesn’t in fact rely on knowing all the parameters of the problem in advance. So you could try it there. But again another kind of approach that would be directly applicable there are Monte Carlo research-based approaches like UCT.

UCT doesn’t assume that it knows the model. It just assumes that it has access to a simulator. However, there is one fine point about UCT. It’s inherently a reinforcement learning approach and what it’s trying to do is it’s trying to optimize cumulative reward.

So in other words it’s not just trying to optimize the cost for the policy that it’s getting, or reward of the policy. It’s trying to also minimize the cost of learning that policy, right?

In planning you typically don’t care about the cost of learning, right? So UCT can actually be made more efficient than it is in its basic form if you only care about optimizing policy rewards. Yeah.

>>: Let’s stop there. Thanks very much.

[applause]

Download