Microsoft 15849

advertisement
Microsoft 15849
Eric Horvitz>>: Okay. It's our honor today to have Eyal Amir with us, visiting from the
University of Illinois at Urbana-Champaign. Eyal is a professor there and he's been there
for several years. Before that he did some post doc work at UC Berkeley, and right
before that he did his Ph.D. at Stanford University working with John McCarthy,
(inaudible) and others.
I came to know Eyal, with some of his innovative work, coming out of his Ph.D. work
and shortly thereafter, linking logical reasoning with probabilistic methods and issues of
separability of factionetic (phonetic) bases. I thought it was very interesting work.
He's gone on to do some more central UAI work after that later on. He's the winner of a
number of awards, including the Arthur L. Samuel award for best computer science,
Ph.D. thesis, Stanford University. That award often is a century for great things to come.
I've known some other winners of that award in the past that have gone on to do great
things.
Today, Eyal will be talking about two topics; one, reachability under uncertainty, and the
other, Bayesian Inverse Reinforcement Learning.
Eyal Amir>>: Thank you, Eric. Thank you all for coming and actually being in this
room. And also thanks for the virtual audience looking at this talk. I just want to give
you a brief -- before I start with the talk, this is my group circa a couple of years ago.
As always, the work that you're going to see is really my students and only partially
mine. So Allen Chang here is -- now he went to do -- into the startup world to make a lot
of money, but he also is a -- is in this -- is contributor of half of this talk. And the other
contributor is Deepak Ramachandran. And next year he's going to be on the market. So
look forward to seeing him, perhaps, in person.
All right. I'll go now to the topic of this talk. I'm going to talk about two separate topics.
They come from trying to cast traditional problems in a slightly different view. One of
them is the problem of reachability in graphs, and the other one is Inverse Reinforcement
Learning, learning from the actions of experts. And I'm going to try -- the two talks are
slightly separate, but I'm going to try to tie them together towards the end.
All right. Here's the first -- here is the first problem. It's something that we started
working on about a year and a half ago.
It's an interesting, very simple problem. You're trying to find reachable -- reachability or
the shortest path in a graph. We have a source and a sync. In this case, the graph is
directed. You can think about it as an undirected graph as well. You're trying to find the
shortest path, the path that minimizes the sum of costs along the edges of the graph.
All right. So this is a very typical, well known problem. There are various standard
solutions for it. What we're looking at are generalizations of this problem. And here
we're looking at cases where the edges may fail.
So typically, if we have a probability for the edge to fail, then we can say that the path
altogether is reliable, if the product of the probabilities is high. Now, there is a hidden
assumption there. And people have used that assumption over the years, and that is that
the probabilities are independent. So the edges may fail independently. And we want to
drop that assumption.
By the way, the motivation for this was just kind of a discussion that I had with some of
the security people in my department that were looking at finding those paths on the
Internet or any network where the probability of the path failing is the least. And so what
we're going to do is we're going to try to make some dependent assumptions about that.
We're going to look at one particular example of a model, and we'll see what we can do
about it.
So I'm going to first present this notion of correlation that I'm going to use in this talk.
And then I'll show why standard probabilistic techniques don't work. I'll show that,
actually, the problem is an NP-hard problem. And I'll show an approximation algorithm
for how to solve that.
So here is a stochastic setting. Let's say that we have this very simple path, and we have
random variables that represent the probability of those edges failing. Now, let's
assume -- we're going to create an extra -- an extra node here. And that will kind of
represent the correlation between those three random variables. Remember, each one of
those variables is the probability that the edge will fail.
Okay. This is a very simple graphical model. To those of you who are familiar with
machine learning, this is a simple Ni-base-like model. So when you look at the
distribution -- the distribution that governs the probability of the path being reliable, you
notice that the path is reliable if the marginal here is high. So probability of edge one
reliable, edge two reliable, edge three reliable, that's -- that is the probability -- that's the
marginal probability that the path is, indeed, reliable.
Now, notice that what we do to get to this marginal is really sum out this hidden variable
X. This is the hidden variable that we represent -- we use to represent that correlation.
So the way that we get to that marginal is by summing out the variable X. So here's one
application of this, the application that we started with.
We noticed that links may fail stochastically and routers will fail stochastically, and there
is some correlation between the times that a link or router will fail. If two routers have
the same version of software on them, then they have some -- they will be susceptible to
the same viruses, to the same problems.
Similarly, if you have -- say, if certain lines were actually built with the same material
and so forth, you can create a -- you can see that those lines will be correlated. And the
fact that we don't know if those routers have the same kind of -- what is this? Let's say, if
we are not sure of what is the virus that is going to attack this system or we're not sure
about some other elements, those will be those hidden variables that we're going to
include in that model?
>>: (Inaudible) same back, and now you're assuming (inaudible).
Eyal Amir>>: Right. I'll flip back to that model. Here I showed it -- I showed this -- I'll
show you the entire path. So I showed this on the path, but really it's a variable that
controls all of the edges in the graph. It's just that I focussed on a particular path.
Okay. So this probabilistic model here, this is a representation of the joint probability
that we will look at. So the probability that edge one is reliable, and two and three and
four, until N, and X, it's the product of these (indicating). Where here, actually, E -- the
probability of EI, actually, there's a missing conditional here. This should be conditioned
on X.
All right. So here is another example of use of such models. Let's say we're trying look
at parsing with weighted finance Sedatoma (phonetic). If we have some probability of
transitions, then if those probabilities of the edges are correlated, then we will be able to
use that kind of a model to represent this system.
And also robot navigation. If you're looking at trying to find the best, most reliable path,
and you say -- I'll give you an example from my Monday.
If you're trying to find an airline path that is most likely to be on time, and you're
considering two paths, one through Chicago and the other one through Indianapolis, you
can say: Well, if Chicago has some weather issues, then Indianapolis will probably have
some weather issues with some correlation between them. If the weather in Chicago is
really nice, then probably the weather in Indianapolis will probably be nice.
Here is the problem with this new problem. You can say: Well, I showed it to -originally to some theory people and they said: Yeah, it should be solvable with some
nice techniques. Here's what fails. So in traditional algorithms, in traditional problems,
this optimal path that we get from the source to the sync, is -- also uses those subpaths
that are optimal.
So the path from S to V is the best path, and if the path from V to T is the best path
between them, then this is the best path that goes through V. Again, maybe there is a
better path that goes somewhere else, but if V was like an island here that we have to go
through, then we could canaliculate the two best paths and get the best path overall.
All right. So we could essentially reuse those sub-problems. What happens with the -with the stochastic variant of this problem is that those paths, if I find the optimal path
from S to V, and the optimal path from V to T, that's not necessarily the optimal path
overall. Here is a very simple example, and then I'll give you some intuition for why that
fails.
The example is this: So I put some probabilities here. We have one hidden variable that
has two values, X is equal to one or two, and this is path one (indicating). This is path
two (indicating). And this is path SIGMA three (indicating). And so the probability -let's say, just to read this for you, the probability that path one is reliable, given that X
equals one, is 0.9.
Right. When you try to compute, we have this prior over X here (indicating), so X is
equal to one is half and X equal to two is half. So when we try to find the best path from
S to V, we compute the marginal over the probability of success of this path. This is just
a nice, simple summing out of X from the -- from the joint for path one. And we do the
same thing for path two and path three.
So let me give you all of this and show you what happened. So this is the best path,
according to the summing out, the probability that path one is reliable is 0.65, the
probability of path two reliable is 0.6. So path one is better. We have only one path here
from V to T; that's path three. So we have to take this one.
Now, if we were going to use Dijkstra's algorithm, or something like that, we would
choose this path from S to Z, and then concatenate it with P3. But notice what happens
here. The probability of taking path one to path three is significantly lower than the one
taking path two to path three.
Now, why did that happen? There are two intuitions that I can give to you; one is that,
when you concatenate paths -- this is very similar to what people in probabilistic
reasoning and machine learning, they exchange the order of the summation of the
product. By exchanging the order of the summation of the product, you assume that
these can be essentially concatenated. In this case, that's not true. Algebraically, that's
not always the case, but sometimes it's a good, simplifying assumption that we make to
get things practical.
Here's another intuition why that doesn't work. The other intuition is, look at what
happens when we are trying to see, what is the probability of failure of the entire path, we
look at the case of X equals to one separately, and the case of X equals to two separately,
and then combine them.
Here we first combined the solutions for X one -- for X one and X two for this path, and
then look at a separate combination for X one and X two this path. They are separate -they are different options. They are different cases. Here this path will be really nice
when X equals to one. This path will be really nice when X is equal to two. This one
will be also really nice when X is equal to two and will be really bad when X equals to.
So overall, this path, the first -- taking path one and then path three, will almost always be
bad. Either it's bad in the first segment or it's bad in the second segment. Here, at least -for one of the segments it will be -- for one of the cases of X, it will be good.
Yeah?
>>: (Inaudible.)
Eyal Amir>>: Right.
>>: (Inaudible.)
Eyal Amir>>: Right. But what is the value? Okay. You say, let's compute it for X one
equals to one?
>>: (Inaudible.)
Eyal Amir>>: Right. But then what? Let's take that suggestion. For X equals to one,
we will compute the optimal path is this. For X equal to two, maybe the optimal path is
this -- sorry, the optimal path for X equal to two is this, for X equal to one, this one is
better.
Then the question of how to combine those two paths, it's not clear. You could say, well,
maybe one of them is the best. But you can find examples where both optimal cases for
either values of X will be as far as you want from the optimal. None of them will be
really close even to the optimal.
>>: Given the (inaudible), given the distribution, there is nothing better to do it.
Eventually, you can combine it by (inaudible) all possible hidden variables, but that's the
only formation (inaudible).
Eyal Amir>>: True, but we have a very specific optimization problem here. We're trying
to find the optimal path, the path that maximizes this formula, the sum, probability of X
times this product. Okay. So what we tried to do is we tried to actually solve this. What
I'm going to show is that there is a way to approximate it.
>>: The probability (inaudible.)
Eyal Amir>>: I know the prior over X.
>>: So why not -- given every possible value of X, compute the short stuff and then
amortize it with respect to the probability of over X.
Eyal Amir>>: (Inaudible) I still need one path at the end. I find the best path for one
value of X, I find the best path for another value of X. I still am looking for the best
overall path. It gives me the best expectation here for success. So I'm trying to send my
truck from San Francisco to New York?
>>: (Inaudible.)
Eyal Amir>>: Right. Well, right. The full combinatorial solution would be just to try all
of the paths between San Francisco and New York.
>>: They do in (inaudible.)
Eyal Amir>>: Right.
>>: Where you start from the end, go back (inaudible.)
Eyal Amir>>: So that's exactly what fails here. Dynamic programming doesn't work
here. That's exactly the argument. If you find -- what would dynamic programming do
for us? It will find the best solution unto Chicago, and then from Chicago, the best path
to New York. That's exactly what fails here.
We tried all kinds of tree width, all kinds of dynamic programming. It all failed. And I
couldn't believe why. I mean, why would it all fail until you actually showed that it's
NP-hard. So let's examine why, kind of. So this gives us an intuition for why it won't
work, but now let's see that it is indeed NP-hard. So we cannot view solutions, that's the
most important kind of intuition that we got from this.
So here is a simple reduction. And I see that the colors here are not going to be very
favorable to me, but I'll try. So I'll read the first few lines.
So this is a reduction from the string template matching. So we have -- a string template
matching is the problem of given, say, several string templates -- and these are strings of
ones and zeros and star, where it star is, I don't care. And you're trying to find a bit string
that matches at least K of the templates.
So let's say here in this case above, so for K equal to two, the answer is yes. So let's see.
This one, there is nothing that will match S two and S three, because one and zero here
don't agree. But there is something that will -- and there is nothing that will match two
and three, because zero and one here don't agree.
But there is something that will match both three and two -- three and one, string
template one and string template three. So this bit string -- so there is a solution for K
equals to two, there is no solution for K equals to three. So that's a standard -- one
standard MP complete problem.
And what we showed is that this says -- kind of just reducing this problem, it is -- we
reduce this problem to our shortest path problem, the stochastic shortest path. So what
we do here is for every string template, we're going to introduce -- so we're going to have
one hidden variable value for each template. And we're going to ask, essentially, which
template does a string match to?
So -- sorry. If we're given a string, the string would be zero if it goes with the upper edge
and one if it goes on the lower edge. And if you give me a path here in this -- in this -- in
this graph, then that path corresponds to a string. Now, I'm going to ask, how many of
the templates that particular string correspond to. Remember that hidden variable that I
have, the hidden variable will have all of the -- let's say if I have, like, N templates, it will
have N values.
And I'm going to essentially count, what is the probability -- what is the highest -- what is
the path that gives me the highest probability in this -- in this graph. So if I have a path,
for example, that covers K strings, then the probability here will be one over -proportional to one over -- to X times -- to K over N.
>>: (Inaudible) taking out the locally optimal path, and you are over -- I'm assuming that
the distribution over X is uniform, you want something that's, on average, the best that
the matches follow the locally optimal paths?
Eyal Amir>>: Right. Remember, this is not the -- this is not an algorithm. This is just a
reduction. If you find a path that matches K strings, then its probability will be K times
one over N. Because one over N is the probability that you fit one bit string -- it's not
exactly one over N, but this is the intuition.
So if you have -- if each one of the strings matches one over N of the -- of your
probability mass, then you're going to try to sum those -- you know, those strings, and
you're trying to sum those probabilities for matching each one of the strings. Then the
problem of finding -- if you match K strings, becomes finding whether you have a path
that that has a certain probability or surpasses that probability in this -- in this graph?
All right. So this is just kind of the intuition for why the problem is hard. We can -- the
previous -- this string matching problem can be reduced to sat, and so this is -- sorry, sat
can be reduced to the string matching problem and the string matching problem can be
reduced to this. So what you get at the end is a problem that is a simple combinatorial
problem.
Notice, by the way, it is NP-hard. It is NP-hard in the number of variables that we have
for the hidden invariable. So if we have N strings, it's NP-hard in that N. Actually, I'll
write it down because this is an interesting point. So it's NP-hard in number of values of
this hidden variable. Okay. So here's an approximation scheme. And what I want you
to -- what I want to point out to you is this approximation scheme doesn't really solve this
NP hardness. I'll explain in a second. So here's the approximation scheme.
We're going to assume -- sorry. We're going to assume for a second that all the
probabilities have this property. So the log of the conditional probability of the edge
given, the hidden variable value is between one and sum Q minus one, the sum Q. So we
will have Q values for the log probabilities.
And what I'm going to do is try to create a dynamic program, but it's a dynamic program
that will use those discrete values as kind of the setup for -- the setup for our dynamic
program. So what we notice first is that the probability that a path is reliable given a
certain value for the hidden variable, that's also one -- that's also kind of an integral value
here between zero and N times Q minus one.
And as a result of that, we can -- we're going to use those integral integers as our islands
in this search. So here is the -- sorry. Here is the algorithm, essentially. So what we do
is this, at each node we compute the Pareto optimal solutions for a similar set of values
here. So here for every -- each CI -- I'll focus on this for a second, because this is really
the main ingredient of this dynamic program.
So if we have D values for the hidden variable, then for value -- so we're going to have
this kind of product of all of the possible -- all the possible probabilities between zero and
Q minus one for each one of those hidden variable values. So for example, I will have
this, let's say, zero, one, one, five, three, two. And this is -- these are the path -- these are
the entries in that table. So this, remember, this is the log -- minus log of the probability
of XI given -- sorry, this is of the edge I -- this is edge I given a certain value for X, but
now here, instead of the edge, we're going to have the probability of the path -- the best -the log probability of the best path that reaches -- that comes in with this value.
So we're going to essentially -- so I said we have here, one, two, three, four, five, six -- so
the domain for XI -- for X is six, the size of the domain. So for each value -- so this is X
equals to one, X equals to two, X equals to three and so forth until X equals to six, for
those six values. For each one of those I'm going to say: What is the path? Give me the
best path that comes in with these minus log probabilities. So when X equals to one, it
gives me zero. When X equals to six, it gives me two. Okay.
So we're going to have this matrix, and for each value here I'm going to compute the best
path, kind of the path or one of the paths that comes with these -- with these minus log
probabilities. And that's what I'm going to now use for my dynamic program. Here's
what happens.
>>: (Inaudible.)
>> Eyal Amir: That's right. It is. The number of things that I will have to evaluate is
exponential indeed, not in the number of nodes in the network, but it is in the number,
indeed. So it doesn't really counter this NP hardness.
>>: (Inaudible.)
Eyal Amir>>: True. Right.
>>: (Inaudible.) I mean, you find the approximation (inaudible.)
Eyal Amir>>: That's right. You immediately see the gap. We were not able to show that
it's hard when the domain size is fixed and we were not able to show an approximation
for the NP-hard problem.
So here is how the algorithm goes. If we computed those -- the dynamic program until a
certain node, we are then going to use those paths -- we can then use those paths to
compute the best -- or kind of those paths with those probabilities to the -- the important
thing is here we're going to actually use all of the path that came in, all of those that we
cached in our dynamic program.
Remember, we cached here -- what was the number? Some polynomial to N is the power
of D. All right. So this is a lot of combinatorial exploring for the machine learning
people in the crowd, but it's an interesting -- it's essentially -- it's an interesting problem
that really doesn't yield to the standard techniques that people use in machine learning.
>>: (Inaudible) dynamic programming needs (inaudible.)
Eyal Amir>>: Yeah. I can show you very simple examples where it fails -- where it fails
badly. I mean, as far as you want from the optimal.
>>: (Inaudible) performance of this?
Eyal Amir>>: Time?
>>: No.
Eyal Amir>>: The approximation? The approximation is really -- the approximation that
you get is essentially this, in time that is proportional to one over epsilon. You get -- if
opt is the reliability of the optimal solution -- so remember this is probability of -- that the
path is reliable, you get a path that is more reliable than opt to the power one plus epsilon.
Epsilon is really -- I assume that all of my probabilities are in those discrete buckets,
right, so you would use that epsilon to round them in some way. Okay.
>>: (Inaudible.)
Eyal Amir>>: Opt is less than one. Remember ->>: (Inaudible.)
Eyal Amir>>: Oh, I see. One minus ->>: (Inaudible.)
Eyal Amir>>: One minus epsilon times ->>: (Inaudible.) Is it the same here?
Eyal Amir>>: I have to say, I don't know. I think -- kind of my intuition is that this is
worse than one minus epsilon times that. But we can sit down afterwards and just write it
down. It's very simple.
>>: Is the probability worse?
Eyal Amir>>: It's worse.
>>: (Inaudible) very small, very low (inaudible.)
>>: The way you labeled the problem, it might be interesting to see more probabilities
here, because you have nice results and graph theory on reachability and based on things
like critical values and base (inaudible), imagine finding a problem where you have some
dependency with some named distributions and see if you can actually prove something
about critical values and dependencies and so on. It's sort of an interesting playground
for new sets of problems, not necessarily solving the reachability directly, but looking at
particular values, critical thresholds that might mentioned. This is -- it sounds like a very
tough area to work in, but it's interesting.
Have you thought about that at all, kind of the results in random graph theory and
reachability, the parameters, and thinking what kind of class of problems can you define
for that community (inaudible) in this model?
Eyal Amir>>: There are so many -- there are different dependence models for different
applications. The answer to you question is no, I haven't gone in that direction. But there
are really different dependency models that one would like to have, like those
dependencies between different roads that I'm going to use through Chicago or different
dependencies of many different random variables that are hidden and affect the system.
>>: (Inaudible) we have some interesting routing problems that might -- pending on -you have trafficking issues that are changing, and depending on your forecast of your
waits over time (inaudible) worlds reached that each segment when you finally get there.
And depending on -- (inaudible) dependency of what will happen, given where you go,
that might not be (inaudible) this problem.
Eyal Amir>>: One interesting kind of a piece of gossip about this, people ask me why
we didn't -- we presented this in UAI of last year, and people asked me why we didn't
submit it to a theory conference. And my answer was that people in theory were
interested in this, but they are more interested -- this is not a fundamental problem,
whereas -- also, the AI community is more interested in those dependency structures
between random variables. The UAI community or the AI community is less -- is not as
good at theory and combinatorial algorithms, but it is a set of problems that are of real
use.
>>: (Inaudible) people are having a hard enough time of independence right now.
Eyal Amir>>: Right.
>>: (Inaudible) the complexity.
Eyal Amir>>: All right. So let me just summarize this part, and I'm going to quickly
switch to the second part. And again, try not to get you too tired. By the way, you can
still ask me questions. I love those -- any questions that you have. I just want to just not
get you too tired.
So here it is. So just what we've shown so far is there is an interesting different model for
reachability. One can use different uncertainty models, and it's a very interesting
problem where a lot of different techniques come into play. It is MP complete, but we
still have a very important gap here between this hardness that depends on the size of the
domain and the P test that we have that is exponential in the size of that domain.
All right. So the second part of the talk, this is slightly -- completely different and
something that is more in line with machine learning theory, is a different problem where
we're looking at Inverse Reinforcement Learning and trying to add knowledge to it in
kind of a probabilistic -- probabilistic prior to -- on top of that, a learning problem.
Just to kind of explain to you what is Inverse Reinforcement Learning, here is an
example. We have this game, and I can -- this is just an adventure game where a player
moves around in a dungeon, collects treasures and avoids monsters and so forth. What
we see is that I -- if I look at somebody who plays that game, I can learn a little bit about
how the person solves the game. And the interesting problem from the perspective of AI
is how to really learn there, and what are you learning as the result.
What people have done -- what people typically do is they try to either pose the problem
as a Markov decision process -- I'm going to describe that to you in a bit -- but kind of
more recently people have tried to use approaches to learn those models, the Markov
decision processes, and that's what we're going to do here.
What we're going to do is we're going to describe a Bayesian model for Inverse
Reinforcement Learning, or learning the model of the expert's choice of actions. And
we're going to use that model to then find the most efficient or the most rewarding set of
decisions in that domain. So what -- our intuition here is that, we would like, in the
learning process, to incorporate some domain knowledge, something that will help us
learn a little better from that teacher or expert or from what we observe.
Here is the model that we're going to look -- that we're going to use. It's a very standard
model in machine learning. There's a Markov decision process where we have a set of
stakes, transactions, actions and a discount factor. Typically, an expert will -- we assume
that the expert will use a certain policy that maximizes his future rewards or something,
kind of the expected future reward is really the sum of reward from the current state and
the next state and the state after that, and so forth, with some discount. So the farther we
are in the future, the more discount we will apply. This Gamma here is less than one.
Yeah?
>>: (Inaudible.)
Eyal Amir>>: Right. So our problem here today would be to actually learn the reward
structure.
>>: (Inaudible.)
Eyal Amir>>: Right. We know that the -- we know that the expert uses a reward
function. We're going to assume something about that reward function, and we're going
to try to use those assumptions to learn a reward function that is more likely to succeed or
more likely to be the correct one for the expert.
>>: (Inaudible.) It's okay to assume that people (inaudible), but when you say that you
want to learn a reward that is from an expert, you're saying that the reward is not
something that is directly observable, it is something that you can play with.
Eyal Amir>>: Right. I mean, the motivation -- we did not come with -- we did not come
up with the Inverse Reinforcement Learning problem. That was originally posed by
Andrew Ng and Stewart Russell like 10 years ago. Their motivation was: We want to
learn -- for example, in the case of Andrew, Andrew's motivation was: I want to learn a
model that will allow me to control, say, a helicopter, or control some unmanned aircraft
vehicle.
I don't get those rewards. I don't know. I can see only what that expert is doing. I don't
know what is the reward structure of that expert. I can pose some rewards, but I really
would like to know what are the rewards that person is having, so that I can use those
rewards in my solution of the problem. There are those cases, and in fact, sometimes
what people do in Inverse Reinforcement Learning is they assume some reward structure
of some sort, either by guessing or by trying to kind of smooth a certain threshold reward
function.
So we're going to try to -- instead of doing that, what we are trying to do with this work is
try to learn the reward of that expert. We already see the expert's doing these actions, we
don't know what the reward is, let's try to learn that reward.
>>: (Inaudible) what are you maximizing? If you get to choose the reward, then I
choose that the reward will always be one million, and then I maximize, right?
Eyal Amir>>: Right. So there will be two optimization problems that we are going to
solve here. One would be to have the kind of minimum distance from the real reward of
the expert. And the other one would be to maximize -- to maximize the expected
rewards. We don't know what those rewards are, right, but we would like to maximize
the expectation of our future rewards.
All right. So I'm going to give a little more details, and I'll explain that a little more. So
I'm going to use this Q function that is typically used in Reinforcement Learning. It
essentially says -- it essentially represents, for a policy pie -- policy here is the way that
you choose your actions. For policy pie, what is the expected reward for a certain state
and a certain action that you're going to perform in that state. Now, we're going to
modify this Q function and add this extra variable R for the reward. This is a reward
function that will be part of our Q function representation. All right.
Remember, what we're going to try to do is we're going to learn that R, but if you give me
that R, here is the evaluation of that Q function. All right. So here is our assumptions.
So we're trying to -- and this is, by the way, this is -- these are assumptions that were -this is our assumption. This is an assumption that was actually made before this work.
We are going to relax those assumptions in this presentation.
One is that X tries to maximize the accumulated expected reward, and the second thing is
that X executes stationary policy and that stationary policy was learned by Reinforcement
Learning algorithm. This is the assumption by the expert. Previous assumptions also
were, by the way, that the expert is doing something that is optimal. And we're going to
drop that here. We're going to allow the expert to be suboptimal and also we're going to
have some prior over what the rewards could be.
I'm going to give some examples in a couple of slides. So let's say that we are given this
prior over rewards, and now I'm going to try to just analyze what one could do with that
prior. Remember, so I'm going to just write it down here, for example. So the reward of
being in a state where you have the gold or you get the gold, let's say that is, say the
number 10.
Just to kind of -- if you have never seen anything like this, then it will be a little bit -- just
to give you some intuition, and the reward for no gold is, say, minus one. Sorry. This is
a little low here. Let's see how to raise this. This is just an example. So the reward
function is really -- sorry, no gold is minus one. And the reward function is just kind of a
function from -- could be a state times action to some real number or it could be just a
function from states to rewards to a real number. It doesn't matter.
Most of the time you can translate from one model to the other. So what we're doing here
is, as I side, we have a prior over those rewards. We don't know what the rewards are,
but we have a probability distribution that governs that reward function. What we're
going to do is we're going to say: Well, we're going to observe some things. Those
observations, these are the actions that the expert had done in those states. So in state
one, he did action one, in state two, he did action two, and so forth.
We're going to assume that given that reward function, all of those actions are
independent. What does that mean? So if somebody already gave that thing to me, then
the probability that the expert will choose action one, given S one, will be independent of
the probability that the agent will choose action two and state two. Okay. It's a very
simplifying assumption.
In some ways it allows you to introduce many different -- many different reward
structures, many different experts here. It doesn't matter. Just assume this. What it
allows us to do is break this into the probability of the product. All of my observations,
all of the things I observed that the agent has done, the probability is really the product of
the separate actions that the agent did.
So here is what this boils down to. If we have a probability that S -- so for every state
and every action, that probability, given the reward, can be represented with this simple
exponential representation with some normalizing factor times E to the Alpha times this
Q function that I mentioned here before.
Remember, we don't know R, but if we were given R, we would be able to compute this
Q function. What is this Alpha here? This is some prior -- this is some measure of the
expertness of the -- of our -- the guy who does the demo for us. So we allow the person
to be wrong. We just kind of measure the number of times that he will be wrong with
this Alpha.
Okay. So what this does -- I said the probability that SI and -- that at state I we choose
action I, given R, that's a dependent of the choice of other actions and other states, and so
the probability that we will get all of those observations is just the sum of those Q values
times Alpha X.
So it really boils down to a nice formula. And what we can do with that is we can now
invert that formula the other way around using base theorem, and now we can compute
the probability of R given the observations. So that probability is just some
normalization factor times this, and now we can include the prior here.
So far I just did very simple manipulation of those formulas. The real question is, okay,
what do we get from this and how do we get the optimal reward? There were two
optimization criterion that were trying to satisfy in different situations. One of them is
we're trying to learn the rewards, so we can use our different loss function, either L one,
L two or whatnot, whatever is most useful in your particular case. This is one problem
that we would like to solve.
The other problem that we would like to solve in other cases is to find the best
expected -- the best policy, the policy that minimizes the expected loss according to
whatever loss function we use here. This is -- you could usually use whatever norm you
would like here in this loss policy, loss for the policy given this reward.
So these are the two minimization -- these are the two optimization problems we would
like to solve. The ways we -- so far we just posed a model, said, what is the optimization
criterion? Now we would like to solve this. And there are two -- there is essentially one
main solution method and two different cases.
One -- the main solution that we're going to use is an MCMC sampling technique where
we essentially valuate Q-star and then use that to decide which Q -- which reward
function to go to next. So what we could do is we could change that policy. Let's say, if
what we were looking for is the optimal policy, we can change that policy to keep track
of -- to not compute this Q-star, we are going to be able to cache sum the values for it and
use the Q-star just by looking at the previous values for it -- sorry, use the previous policy
for -- that we used with the previous R.
Now we're going to change R a little bit, and so once we change R a little bit, not all of
the Q values are going to change. And so we can use the previous -- the previous
computation for the policy, the chosen policy to now estimate the new Q value.
>>: (Inaudible) keep track of (inaudible) policy?
Eyal Amir>>: So let's say for a particular reward function, we now compute the optimal
policy.
>>: (Inaudible.)
Eyal Amir>>: That's right. Now we're going to change -- we're going to use MCMC to
perturb this reward function a little bit. We're going to now choose a different reward
function, evaluate the expected -- the probability that that is the reward function in our
model using the previous model that I presented, we can now compute -- I cannot
optimize their reward function given that form that I showed, but I can evaluate. For
every reward function, I can evaluate that probability given the observations that I had.
So I'm going to evaluate that new reward function that I'm now proposing, and with some
probability, I will change the reward to that structure. Now, I have a problem that in
order to compute, every time that I'm trying this MCMC step, I need to evaluate this Q
value in order for me to remember that Q was a participant in the computation, the
probability of R -- I have this -- the probability of R given the observations of the expert's
actions. And that included something that was an exponential model with sum -- sum of
Q values.
So I need to use this (indicating) in order to evaluate this (indicating). Now, to use this,
every time that I'm going to even try to make a step in my MCMC, I'm going to spend a
lot of time computing that Q function. So to compute -- yeah. So to compute that Q
function, I will -- it's important to essentially try to do something smarter than just brute
force computation.
>>: (Inaudible.)
Eyal Amir>>: Say it again.
>>: So are you trying estimate the reward function given optimizations, (inaudible) the Q
values in terms of (inaudible.)
Eyal Amir>>: Right.
>>: Whether your actor or your tutor has been following a different policy (inaudible) is
there an optimal one?
Eyal Amir>>: You're right. So the tutor may have followed his own policy, but I
assume -- so I don't know what is the optimal policy. And it might not be the tutor's
optimal policy. But I'm going to try to estimate the reward function. And in order to
estimate the reward function -- so I'm going to use that estimate of the reward function
later in choosing my optimal policy.
I don't know if his policy is the optimal one, but I'm going to try to choose to estimate R
in some -- you know, for the two different optimization problems, I'm going to do it -- I'm
going to do a different kind of estimate here. But here, if I'm looking for a most likely R,
I'm going to need to use -- and change the policy, because the Q value really will depend
on the policy. And I don't know what is the policy. See, the Q value -- I don't know what
the policy of the -- of the expert is.
>>: So you're not necessarily -Eyal Amir>>: I just see ->>: There's two policy estimation problems here, one is an estimation (inaudible), and
then there's another estimation policy?
Eyal Amir>>: No, it won't happen later. I'm not trying to estimate what is the policy of
the expert. I just see the actions that the expert did. Okay. I'm just going to try to
estimate my optimal policy given my prior, because I don't know what the rewards are.
All right. So what happens is we compute Q, and then we use that to compute the
probability of the R that we're testing right now. And we are going to use that in a hill
climbing -- it's not really hill climbing. It's an MCMC step that will -- to get to an
estimate of the most likely R.
So interest things about this procedure -- so I kind of described this procedure really in
two words, but in this procedure, what happens with MCMC is that you create samples -you kind of have many steps -- many steps of this changing of the reward function. And
after, say, one thousand steps, I reach some reward function and I take that as a sample.
This is one sample. And now I'm going to start my MCMC again or I'm going to run this
chain again and again.
I'm going to run it many times until I generate samples of my reward function. Using
those samples, I can then estimate what is the most likely reward function. I can estimate
also what is the expected -- let's say the policy with the maximum expected sum of future
rewards. So I can use those samples for whatever I want. But in that MCMC step, still
the basic step really generates those samples of rewards.
So there are two interesting things that I want to say about the sampling method. So one
of them is that when the -- when we have a uniform prior over the rewards, turns out that
this Markov chain rapidly converges -- rapidly mixes, so you don't need to make too
many steps in order for you to create those samples for your estimation problem.
So typically, you don't know how fast your Markov chain will mix, so you have to run it
and just hope that it will be fine. This is not our result. I mean, it is our result, but it
really derived from somebody else's result on sampling and convex regions that are
uniform and distributed. So uniform distributions over convex regions.
So this is the citation that we use. And the second thing is that even in those cases
where -- so this is nice just because it guarantees for us some rapid convergence of the
algorithm. The second thing that I want to show is that really this -- in experiments, even
without the -- even without the uniform prior, this thing converged relatively well,
relatively fast and to solutions that are better than just the standard IRL without such
prior.
So here I compared two versions of a Bayesian IRL, one with Q-learning and the other
one with k greedy. So what you see here is -- so the top of -- sorry. These two lines here
are from -- these are the expected reward loss. This is when we were trying to just learn
rewards. This is the standard IRL and these are the two new vision IRLs, so just to kind
of explain to you, the lower the curve, the better it is. Because what we are looking for is
something that has a zero reward loss.
So what you would ->>: (Inaudible.)
Eyal Amir>>: Sorry. N is the, I think is the size of the domain that we -- that we
experimented with. We tried it on different -- yeah, the number of states.
And here I want to show you, this is the -- kind of the policy law -- (off microphone)
Sorry. Just a second.
And this is the same curves, but now the expected -- kind of finding the policy that
maximizes the expected sum of future rewards, again, you don't know what the reward is,
but given the expected reward with these -- with these priors, this is what we get.
>>: You said that (inaudible) expert is nodal, right, you said that earlier, that's the
outline?
Eyal Amir>>: Right.
>>: So what you're showing here is the difference from the policy of the expert -whoever the known expert is or the optimal policy?
Eyal Amir>>: This is this difference from the optimal policy. Actually, I think that we
used a very -- we didn't try it with different Alphas. We tried it with a very -- with a
fixed one. I think it was a relatively high one, but I know we didn't change the Alpha.
But this is the difference from the optimal -- the optimal policy in that particular domain.
We tried it on two different domains. I think these graphs are of the adventure game
example.
I want to show you one more graph here in this adventure game. Here is just one more
example of a prior. We used like (inaudible) model like prior where when you have -say, when you have the treasure, then you will have high rewards. So this is something
we can assume as a prior over the reward. And then we correlate states that are adjacent.
If two states are adjacent, they will have a higher reward, supposedly.
This kind of prior, we then applied it to this adventure game, and you can see with this
nice fire, you could get -- again, better reward loss compared to our expert that is trying
to play the game and knows what it does when it play that game. We used -- again, over
there we used, I think, a shaped reward function. So here just kind of a very simple
measure of how they -- how the reward loss was a little better. All right.
So I said I'm going to try to tie it together. I gave you about an hour talk, so let me -- let
me just try to summarize what we've shown here in the second part of the talk.
We tried to generalize this Inverse Reinforcement Learning problem, introduced some
different priors. The motivation was to -- both allow us to learn reward functions when
we have some knowledge about the domain, and secondly, if you look at the problem of
Inverse Reinforcement Learning, you usually get relatively little data and you have a
convex domain that has many possible solutions.
To choose one of those solutions over another is a matter of heuristic, unless you have
some prior knowledge over R, and that's what we tried to solve here. The computational
problems here are still pretty big, but for those small domains that we experimented with,
it was manageable. So that's it. I'm -- if you have any more questions, I'm happy to
answer them. Thank you.
(Applause).
>>: (Inaudible.)
Eyal Amir>>: Right. So we have papers on this. The first one is in UAI. It's uncertainty
in AI07 and the other one was in IJCAI of last year also. This is number one (indicating),
this is number two (indicating).
>>: I think -- I'm not sure, but even for the first problems, even if you have only two
options, it's already (inaudible)?
Eyal Amir>>: You think that?
>>: I have some kind of proof (inaudible.)
Eyal Amir>>: I would love that. That would be interesting. All right. Thank you very
much, then.
Download